2025-05-07T20:22:34.9461615Z Current runner version: '2.323.0'
2025-05-07T20:22:34.9469256Z Runner name: 'i-0efa96680de6b8d22'
2025-05-07T20:22:34.9470432Z Machine name: 'ip-10-0-51-101'
2025-05-07T20:22:34.9473264Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:22:34.9475608Z Contents: read
2025-05-07T20:22:34.9476126Z Metadata: read
2025-05-07T20:22:34.9476620Z Packages: read
2025-05-07T20:22:34.9477118Z ##[endgroup]
2025-05-07T20:22:34.9479391Z Secret source: None
2025-05-07T20:22:34.9480571Z Prepare workflow directory
2025-05-07T20:22:35.0005237Z Prepare all required actions
2025-05-07T20:22:35.0043075Z Getting action download info
2025-05-07T20:22:35.2061210Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:22:35.4743066Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:22:35.8140803Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:22:37.4082357Z Getting action download info
2025-05-07T20:22:37.5150690Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:22:37.7098075Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.9, 12.8.0, 12.6.3, clang)
2025-05-07T20:22:37.7665927Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:22:37.7788576Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:22:37.7800953Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:22:37.7802004Z ##[endgroup]
2025-05-07T20:22:38.9156802Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:22:38.9157257Z Instance Type: g5.4xlarge
2025-05-07T20:22:38.9157505Z AMI Name: unknown
2025-05-07T20:22:38.9197348Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:22:44.3275432Z ##[group]Run actions/checkout@v4
2025-05-07T20:22:44.3275734Z with:
2025-05-07T20:22:44.3275968Z   submodules: true
2025-05-07T20:22:44.3276199Z   repository: pytorch/FBGEMM
2025-05-07T20:22:44.3276582Z   token: ***
2025-05-07T20:22:44.3276781Z   ssh-strict: true
2025-05-07T20:22:44.3276993Z   ssh-user: git
2025-05-07T20:22:44.3277215Z   persist-credentials: true
2025-05-07T20:22:44.3277460Z   clean: true
2025-05-07T20:22:44.3277685Z   sparse-checkout-cone-mode: true
2025-05-07T20:22:44.3277953Z   fetch-depth: 1
2025-05-07T20:22:44.3278164Z   fetch-tags: false
2025-05-07T20:22:44.3278377Z   show-progress: true
2025-05-07T20:22:44.3278598Z   lfs: false
2025-05-07T20:22:44.3278800Z   set-safe-directory: true
2025-05-07T20:22:44.3279055Z env:
2025-05-07T20:22:44.3279262Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:44.3279567Z   BUILD_ENV: build_binary
2025-05-07T20:22:44.3279831Z   BUILD_TARGET: genai
2025-05-07T20:22:44.3280050Z   BUILD_VARIANT: cuda
2025-05-07T20:22:44.3280307Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:22:44.3280556Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:44.3280795Z ##[endgroup]
2025-05-07T20:22:44.4474485Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:22:44.4475632Z ##[group]Getting Git version info
2025-05-07T20:22:44.4476080Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.4476694Z [command]/usr/bin/git version
2025-05-07T20:22:44.4476955Z git version 2.47.1
2025-05-07T20:22:44.4485353Z ##[endgroup]
2025-05-07T20:22:44.4499429Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/32e5c665-7cf4-4445-941f-b58e67342ba4' before making global git config changes
2025-05-07T20:22:44.4500337Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:22:44.4513869Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.4552860Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.4555737Z ##[group]Initializing the repository
2025-05-07T20:22:44.4560046Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.4602656Z hint: Using 'master' as the name for the initial branch. This default branch name
2025-05-07T20:22:44.4603409Z hint: is subject to change. To configure the initial branch name to use in all
2025-05-07T20:22:44.4603950Z hint: of your new repositories, which will suppress this warning, call:
2025-05-07T20:22:44.4604334Z hint:
2025-05-07T20:22:44.4604622Z hint: 	git config --global init.defaultBranch <name>
2025-05-07T20:22:44.4604957Z hint:
2025-05-07T20:22:44.4605275Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
2025-05-07T20:22:44.4605822Z hint: 'development'. The just-created branch can be renamed via this command:
2025-05-07T20:22:44.4606237Z hint:
2025-05-07T20:22:44.4606452Z hint: 	git branch -m <name>
2025-05-07T20:22:44.4606949Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/
2025-05-07T20:22:44.4614749Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM
2025-05-07T20:22:44.4648741Z ##[endgroup]
2025-05-07T20:22:44.4649228Z ##[group]Disabling automatic garbage collection
2025-05-07T20:22:44.4652521Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:22:44.4683598Z ##[endgroup]
2025-05-07T20:22:44.4684014Z ##[group]Setting up auth
2025-05-07T20:22:44.4690013Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:22:44.4721363Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:22:44.5084009Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:22:44.5115728Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:22:44.5461121Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:44.5511348Z ##[endgroup]
2025-05-07T20:22:44.5511772Z ##[group]Fetching the repository
2025-05-07T20:22:44.5519975Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:22:45.3135755Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:22:45.3136447Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:22:45.3162246Z ##[endgroup]
2025-05-07T20:22:45.3162633Z ##[group]Determining the checkout info
2025-05-07T20:22:45.3165163Z ##[endgroup]
2025-05-07T20:22:45.3179797Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:22:45.3219570Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:22:45.3250344Z ##[group]Checking out the ref
2025-05-07T20:22:45.3254697Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:22:45.4348571Z Note: switching to 'refs/remotes/pull/4066/merge'.
2025-05-07T20:22:45.4348854Z 
2025-05-07T20:22:45.4349139Z You are in 'detached HEAD' state. You can look around, make experimental
2025-05-07T20:22:45.4349768Z changes and commit them, and you can discard any commits you make in this
2025-05-07T20:22:45.4350463Z state without impacting any branches by switching back to a branch.
2025-05-07T20:22:45.4350793Z 
2025-05-07T20:22:45.4351014Z If you want to create a new branch to retain commits you create, you may
2025-05-07T20:22:45.4351505Z do so (now or later) by using -c with the switch command. Example:
2025-05-07T20:22:45.4351785Z 
2025-05-07T20:22:45.4351904Z   git switch -c <new-branch-name>
2025-05-07T20:22:45.4352109Z 
2025-05-07T20:22:45.4352235Z Or undo this operation with:
2025-05-07T20:22:45.4352417Z 
2025-05-07T20:22:45.4352523Z   git switch -
2025-05-07T20:22:45.4352959Z 
2025-05-07T20:22:45.4353202Z Turn off this advice by setting config variable advice.detachedHead to false
2025-05-07T20:22:45.4353556Z 
2025-05-07T20:22:45.4353971Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:22:45.4364038Z ##[endgroup]
2025-05-07T20:22:45.4364474Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:22:45.4370799Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:45.4422979Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:22:45.4455497Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:22:45.4487834Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:22:45.4515717Z ##[endgroup]
2025-05-07T20:22:45.4516100Z ##[group]Fetching submodules
2025-05-07T20:22:45.4518988Z [command]/usr/bin/git submodule sync
2025-05-07T20:22:45.4861930Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:22:45.5194949Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit'
2025-05-07T20:22:45.5198080Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel'
2025-05-07T20:22:45.5201722Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo'
2025-05-07T20:22:45.5205875Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass'
2025-05-07T20:22:45.5210044Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest'
2025-05-07T20:22:45.5214410Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch'
2025-05-07T20:22:45.5218201Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json'
2025-05-07T20:22:45.5249575Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'...
2025-05-07T20:22:45.8713772Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'...
2025-05-07T20:22:46.3279272Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'...
2025-05-07T20:22:46.7141112Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'...
2025-05-07T20:22:47.7129567Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'...
2025-05-07T20:22:48.0503845Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'...
2025-05-07T20:22:48.2944684Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'...
2025-05-07T20:22:49.4688276Z From https://github.com/asmjit/asmjit
2025-05-07T20:22:49.4688829Z  * branch            e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD
2025-05-07T20:22:49.5168297Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:22:50.1628686Z From https://github.com/jwfromm/composable_kernel
2025-05-07T20:22:50.1629185Z  * branch            4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD
2025-05-07T20:22:50.4328164Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:22:51.4403327Z From https://github.com/pytorch/cpuinfo
2025-05-07T20:22:51.4403772Z  * branch            6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD
2025-05-07T20:22:51.5458187Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:22:52.5966120Z From https://github.com/jwfromm/cutlass
2025-05-07T20:22:52.5966593Z  * branch            3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD
2025-05-07T20:22:53.2929084Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:22:54.0626567Z From https://github.com/google/googletest
2025-05-07T20:22:54.0627026Z  * branch            f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD
2025-05-07T20:22:54.1042745Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:22:54.7291567Z From https://github.com/ROCmSoftwarePlatform/hipify_torch
2025-05-07T20:22:54.7292231Z  * branch            420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD
2025-05-07T20:22:54.7374366Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:22:55.4578239Z From https://github.com/nlohmann/json
2025-05-07T20:22:55.4578843Z  * branch            9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD
2025-05-07T20:22:55.5716112Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:22:55.5735834Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:22:55.6066853Z Entering 'external/asmjit'
2025-05-07T20:22:55.6099493Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.6131478Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.6163373Z Entering 'external/cutlass'
2025-05-07T20:22:55.6195618Z Entering 'external/googletest'
2025-05-07T20:22:55.6227739Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.6259334Z Entering 'external/json'
2025-05-07T20:22:55.6303574Z ##[endgroup]
2025-05-07T20:22:55.6304065Z ##[group]Persisting credentials for submodules
2025-05-07T20:22:55.6310237Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:22:55.6638120Z Entering 'external/asmjit'
2025-05-07T20:22:55.6705581Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.6775960Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.6842486Z Entering 'external/cutlass'
2025-05-07T20:22:55.6916272Z Entering 'external/googletest'
2025-05-07T20:22:55.6983262Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.7051780Z Entering 'external/json'
2025-05-07T20:22:55.7135852Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:22:55.7463198Z Entering 'external/asmjit'
2025-05-07T20:22:55.7525718Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:22:55.7527961Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.7590334Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:22:55.7593555Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.7654410Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:22:55.7658150Z Entering 'external/cutlass'
2025-05-07T20:22:55.7717335Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:22:55.7720283Z Entering 'external/googletest'
2025-05-07T20:22:55.7779861Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:22:55.7783506Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.7843156Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:22:55.7846268Z Entering 'external/json'
2025-05-07T20:22:55.7908962Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:22:55.8002545Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:22:55.8333932Z Entering 'external/asmjit'
2025-05-07T20:22:55.8365923Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.8397686Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.8429075Z Entering 'external/cutlass'
2025-05-07T20:22:55.8460557Z Entering 'external/googletest'
2025-05-07T20:22:55.8492752Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.8524069Z Entering 'external/json'
2025-05-07T20:22:55.8571369Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:22:55.8903002Z Entering 'external/asmjit'
2025-05-07T20:22:55.8933574Z Entering 'external/composable_kernel'
2025-05-07T20:22:55.8964889Z Entering 'external/cpuinfo'
2025-05-07T20:22:55.8996766Z Entering 'external/cutlass'
2025-05-07T20:22:55.9028068Z Entering 'external/googletest'
2025-05-07T20:22:55.9060046Z Entering 'external/hipify_torch'
2025-05-07T20:22:55.9091239Z Entering 'external/json'
2025-05-07T20:22:55.9152694Z ##[endgroup]
2025-05-07T20:22:55.9177170Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:22:55.9203927Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:22:55.9392190Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:22:55.9392502Z with:
2025-05-07T20:22:55.9392748Z   name: fbgemm_genai_x86_clang_py3.9_cu12.8.0.whl
2025-05-07T20:22:55.9393072Z   merge-multiple: false
2025-05-07T20:22:55.9393327Z   repository: pytorch/FBGEMM
2025-05-07T20:22:55.9393586Z   run-id: 14891846252
2025-05-07T20:22:55.9393790Z env:
2025-05-07T20:22:55.9394015Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:55.9394310Z   BUILD_ENV: build_binary
2025-05-07T20:22:55.9394549Z   BUILD_TARGET: genai
2025-05-07T20:22:55.9394761Z   BUILD_VARIANT: cuda
2025-05-07T20:22:55.9394990Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:22:55.9395235Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:55.9395475Z ##[endgroup]
2025-05-07T20:22:56.1749831Z Downloading single artifact
2025-05-07T20:22:56.2606384Z Preparing to download the following artifacts:
2025-05-07T20:22:56.2607284Z - fbgemm_genai_x86_clang_py3.9_cu12.8.0.whl (ID: 3081405239, Size: 18501145, Expected Digest: sha256:49d17600359b05f780104ac5b5c7182a7fffa14a07ce833b6d20dd778f161f31)
2025-05-07T20:22:56.3137266Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-c9082d77-4e4e-5fd7-9873-085c291b0b68/artifacts/dcc1e5fc208536aec3c652f2daa6cf51fe28cc42d2657b9bbfc350fdc93bbce4.zip
2025-05-07T20:22:56.3138861Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:56.4301122Z (node:57020) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:22:56.4302162Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:22:56.7200718Z SHA256 digest of downloaded artifact is 49d17600359b05f780104ac5b5c7182a7fffa14a07ce833b6d20dd778f161f31
2025-05-07T20:22:56.7201380Z Artifact download completed successfully.
2025-05-07T20:22:56.7201720Z Total of 1 artifact(s) downloaded
2025-05-07T20:22:56.7207024Z Download artifact has finished successfully
2025-05-07T20:22:56.7532218Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:22:56.7532838Z with:
2025-05-07T20:22:56.7533159Z   driver-version: 570.133.07
2025-05-07T20:22:56.7533554Z env:
2025-05-07T20:22:56.7533881Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.7534355Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.7534732Z   BUILD_TARGET: genai
2025-05-07T20:22:56.7535073Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.7535432Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:22:56.7535835Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.7536202Z ##[endgroup]
2025-05-07T20:22:56.7638534Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:22:56.7638936Z with:
2025-05-07T20:22:56.7639337Z   timeout_minutes: 10
2025-05-07T20:22:56.7639573Z   max_attempts: 3
2025-05-07T20:22:56.7665177Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:22:56.7691437Z   retry_wait_seconds: 10
2025-05-07T20:22:56.7691702Z   polling_interval_seconds: 1
2025-05-07T20:22:56.7691968Z   warning_on_retry: true
2025-05-07T20:22:56.7692225Z   continue_on_error: false
2025-05-07T20:22:56.7692476Z env:
2025-05-07T20:22:56.7692705Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.7693021Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.7693275Z   BUILD_TARGET: genai
2025-05-07T20:22:56.7693509Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.7693761Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:22:56.7694030Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.7694281Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:22:56.7694541Z ##[endgroup]
2025-05-07T20:22:56.8514162Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:22:56.8514909Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:22:56.8518527Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:22:57.4361275Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:22:57.4361669Z No packages marked for removal.
2025-05-07T20:22:57.4423804Z Dependencies resolved.
2025-05-07T20:22:57.4437653Z Nothing to do.
2025-05-07T20:22:57.4437884Z Complete!
2025-05-07T20:22:57.4758564Z + install_nvidia_driver_common
2025-05-07T20:22:57.4763080Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:22:57.4763450Z + lspci
2025-05-07T20:22:57.4764105Z Before installing NVIDIA driver
2025-05-07T20:22:57.4965368Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:57.4966222Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:57.4966809Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:57.4967346Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:57.4968060Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:57.4968654Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:57.4969152Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:57.4969648Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:57.4970070Z + lsmod
2025-05-07T20:22:57.5011514Z Module                  Size  Used by
2025-05-07T20:22:57.5011809Z xt_conntrack           16384  1
2025-05-07T20:22:57.5012084Z nft_chain_nat          16384  3
2025-05-07T20:22:57.5012350Z xt_MASQUERADE          20480  1
2025-05-07T20:22:57.5012660Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:57.5013005Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:57.5013418Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:57.5013866Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:57.5014225Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:57.5014526Z xfrm_user              57344  1
2025-05-07T20:22:57.5014821Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:57.5015134Z xt_addrtype            16384  2
2025-05-07T20:22:57.5015397Z nft_compat             20480  4
2025-05-07T20:22:57.5015707Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:57.5016128Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:57.5016530Z br_netfilter           36864  0
2025-05-07T20:22:57.5016807Z bridge                323584  1 br_netfilter
2025-05-07T20:22:57.5017120Z stp                    16384  1 bridge
2025-05-07T20:22:57.5017411Z llc                    16384  2 bridge,stp
2025-05-07T20:22:57.5017703Z overlay               167936  0
2025-05-07T20:22:57.5017956Z tls                   135168  0
2025-05-07T20:22:57.5018199Z nls_ascii              16384  1
2025-05-07T20:22:57.5018453Z nls_cp437              20480  1
2025-05-07T20:22:57.5018704Z vfat                   24576  1
2025-05-07T20:22:57.5018956Z fat                    86016  1 vfat
2025-05-07T20:22:57.5019230Z sunrpc                696320  1
2025-05-07T20:22:57.5019480Z ena                   180224  0
2025-05-07T20:22:57.5019721Z i8042                  45056  0
2025-05-07T20:22:57.5019975Z serio                  28672  3 i8042
2025-05-07T20:22:57.5020256Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:57.5020523Z button                 24576  0
2025-05-07T20:22:57.5020779Z sch_fq_codel           20480  17
2025-05-07T20:22:57.5021039Z dm_mod                188416  0
2025-05-07T20:22:57.5021293Z loop                   36864  0
2025-05-07T20:22:57.5021539Z fuse                  163840  1
2025-05-07T20:22:57.5021790Z configfs               57344  1
2025-05-07T20:22:57.5022050Z dax                    45056  1 dm_mod
2025-05-07T20:22:57.5022322Z dmi_sysfs              20480  0
2025-05-07T20:22:57.5022580Z crc32_pclmul           16384  0
2025-05-07T20:22:57.5022838Z crc32c_intel           24576  0
2025-05-07T20:22:57.5023088Z efivarfs               24576  1
2025-05-07T20:22:57.5023345Z + modinfo nvidia
2025-05-07T20:22:57.5031240Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:57.5031739Z import_ns:      DMA_BUF
2025-05-07T20:22:57.5031984Z alias:          char-major-195-*
2025-05-07T20:22:57.5032263Z version:        570.133.07
2025-05-07T20:22:57.5032518Z supported:      external
2025-05-07T20:22:57.5032925Z license:        Dual MIT/GPL
2025-05-07T20:22:57.5033234Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:57.5033581Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:57.5034312Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:57.5034640Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:57.5034991Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:57.5035334Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:57.5035655Z depends:        i2c-core,drm
2025-05-07T20:22:57.5035908Z retpoline:      Y
2025-05-07T20:22:57.5036126Z name:           nvidia
2025-05-07T20:22:57.5036494Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:57.5036976Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:57.5037447Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:57.5038000Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:57.5038311Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:57.5038626Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:57.5038950Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:57.5039253Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:57.5039570Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:57.5039937Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:57.5040333Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:57.5040672Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:57.5040980Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:57.5041285Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:57.5041646Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:57.5042046Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:57.5042431Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:57.5042849Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.5043265Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:57.5043698Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.5044126Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:57.5044468Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:57.5044897Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:57.5045281Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:57.5045619Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:57.5045947Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:57.5046284Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:57.5046606Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:57.5046921Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:57.5047279Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:57.5047653Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:57.5047977Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:57.5048322Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:57.5048683Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:57.5049020Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:57.5049370Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:57.5049712Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:57.5050001Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:57.5050335Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:57.5050677Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:57.5050990Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:57.5051325Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:57.5051694Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:57.5052047Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:57.5052377Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:57.5052731Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:57.5053083Z parm:           rm_firmware_active:charp
2025-05-07T20:22:57.5053477Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:22:57.5053729Z ++ command -v nvidia-smi
2025-05-07T20:22:57.5053996Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:22:57.5054256Z + set +e
2025-05-07T20:22:57.5054576Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:22:59.3286367Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:22:59.3286745Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:22:59.3287061Z + '[' 0 -ne 0 ']'
2025-05-07T20:22:59.3287366Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:22:59.3287753Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:22:59.3288374Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:22:59.3289041Z + set -e
2025-05-07T20:22:59.3289787Z + '[' 1 -eq 0 ']'
2025-05-07T20:22:59.3290347Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:22:59.3290904Z + post_install_nvidia_driver_common
2025-05-07T20:22:59.3294668Z + sudo modprobe nvidia
2025-05-07T20:22:59.4778632Z + echo 'After installing NVIDIA driver'
2025-05-07T20:22:59.4778954Z + lspci
2025-05-07T20:22:59.4779427Z After installing NVIDIA driver
2025-05-07T20:22:59.4899513Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:59.4900031Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:59.4900607Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:59.4901155Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:59.4901647Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:59.4902198Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:59.4902719Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:59.4903209Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:59.4903634Z + lsmod
2025-05-07T20:22:59.4931408Z Module                  Size  Used by
2025-05-07T20:22:59.4931706Z nvidia_uvm           1884160  0
2025-05-07T20:22:59.4932147Z nvidia              11583488  1 nvidia_uvm
2025-05-07T20:22:59.4932478Z drm                   602112  1 nvidia
2025-05-07T20:22:59.4932792Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:22:59.4933115Z backlight              24576  1 drm
2025-05-07T20:22:59.4933409Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:22:59.4933711Z xt_conntrack           16384  1
2025-05-07T20:22:59.4933967Z nft_chain_nat          16384  3
2025-05-07T20:22:59.4934230Z xt_MASQUERADE          20480  1
2025-05-07T20:22:59.4934549Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:59.4934892Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:59.4935299Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:59.4935755Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:59.4936080Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:59.4936377Z xfrm_user              57344  1
2025-05-07T20:22:59.4936648Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:59.4936953Z xt_addrtype            16384  2
2025-05-07T20:22:59.4937207Z nft_compat             20480  4
2025-05-07T20:22:59.4937513Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:59.4937943Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:59.4938326Z br_netfilter           36864  0
2025-05-07T20:22:59.4938602Z bridge                323584  1 br_netfilter
2025-05-07T20:22:59.4938902Z stp                    16384  1 bridge
2025-05-07T20:22:59.4939184Z llc                    16384  2 bridge,stp
2025-05-07T20:22:59.4939478Z overlay               167936  0
2025-05-07T20:22:59.4939729Z tls                   135168  0
2025-05-07T20:22:59.4939984Z nls_ascii              16384  1
2025-05-07T20:22:59.4940523Z nls_cp437              20480  1
2025-05-07T20:22:59.4940777Z vfat                   24576  1
2025-05-07T20:22:59.4941036Z fat                    86016  1 vfat
2025-05-07T20:22:59.4941296Z sunrpc                696320  1
2025-05-07T20:22:59.4941546Z ena                   180224  0
2025-05-07T20:22:59.4941789Z i8042                  45056  0
2025-05-07T20:22:59.4942034Z serio                  28672  3 i8042
2025-05-07T20:22:59.4942315Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:59.4942570Z button                 24576  0
2025-05-07T20:22:59.4942818Z sch_fq_codel           20480  17
2025-05-07T20:22:59.4943077Z dm_mod                188416  0
2025-05-07T20:22:59.4943324Z loop                   36864  0
2025-05-07T20:22:59.4943561Z fuse                  163840  1
2025-05-07T20:22:59.4943942Z configfs               57344  1
2025-05-07T20:22:59.4944199Z dax                    45056  1 dm_mod
2025-05-07T20:22:59.4944475Z dmi_sysfs              20480  0
2025-05-07T20:22:59.4944721Z crc32_pclmul           16384  0
2025-05-07T20:22:59.4944980Z crc32c_intel           24576  0
2025-05-07T20:22:59.4945233Z efivarfs               24576  1
2025-05-07T20:22:59.4945474Z + modinfo nvidia
2025-05-07T20:22:59.4949536Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:59.4950159Z import_ns:      DMA_BUF
2025-05-07T20:22:59.4950401Z alias:          char-major-195-*
2025-05-07T20:22:59.4950660Z version:        570.133.07
2025-05-07T20:22:59.4950899Z supported:      external
2025-05-07T20:22:59.4951142Z license:        Dual MIT/GPL
2025-05-07T20:22:59.4951423Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:59.4951767Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:59.4952090Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:59.4952411Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:59.4952755Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:59.4953094Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:59.4953408Z depends:        i2c-core,drm
2025-05-07T20:22:59.4953664Z retpoline:      Y
2025-05-07T20:22:59.4953875Z name:           nvidia
2025-05-07T20:22:59.4954242Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:59.4954723Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:59.4955180Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:59.4955609Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:59.4955917Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:59.4956217Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:59.4956533Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:59.4956829Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:59.4957141Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:59.4957506Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:59.4957902Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:59.4958231Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:59.4958539Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:59.4958846Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:59.4959205Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:59.4959606Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:59.4959994Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:59.4960407Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:59.4960818Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:59.4961239Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:59.4961656Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:59.4961987Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:59.4962357Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:59.4962841Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:59.4963182Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:59.4963507Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:59.4963842Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:59.4964160Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:59.4964473Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:59.4964823Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:59.4965180Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:59.4965512Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:59.4965846Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:59.4966189Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:59.4966616Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:59.4966954Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:59.4967287Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:59.4967576Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:59.4967899Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:59.4968225Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:59.4968535Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:59.4968864Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:59.4969227Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:59.4969578Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:59.4969898Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:59.4970244Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:59.4970584Z parm:           rm_firmware_active:charp
2025-05-07T20:22:59.4970865Z + set +e
2025-05-07T20:22:59.4971060Z + nvidia-smi
2025-05-07T20:23:00.9061907Z Wed May  7 20:23:00 2025       
2025-05-07T20:23:00.9062321Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:00.9062882Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:00.9063399Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:00.9063918Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:00.9064465Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:00.9064918Z |                                         |                        |               MIG M. |
2025-05-07T20:23:00.9065271Z |=========================================+========================+======================|
2025-05-07T20:23:00.9127124Z |   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:00.9127635Z |  0%   29C    P0             64W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:00.9128061Z |                                         |                        |                  N/A |
2025-05-07T20:23:00.9128485Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:00.9128903Z                                                                                          
2025-05-07T20:23:00.9129317Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:00.9129773Z | Processes:                                                                              |
2025-05-07T20:23:00.9130240Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:00.9130670Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:00.9131030Z |=========================================================================================|
2025-05-07T20:23:00.9132226Z |  No running processes found                                                             |
2025-05-07T20:23:00.9132985Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.3205590Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:02.7255230Z NVIDIA A10G
2025-05-07T20:23:02.9933279Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:02.9933541Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:02.9933791Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:02.9934090Z + set -e
2025-05-07T20:23:02.9934309Z INFO: Ignoring allowed status 0
2025-05-07T20:23:02.9942878Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:02.9947946Z + sudo yum install -y yum-utils
2025-05-07T20:23:03.4110418Z Last metadata expiration check: 0:07:09 ago on Wed May  7 20:15:54 2025.
2025-05-07T20:23:03.4359435Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:03.4756919Z Dependencies resolved.
2025-05-07T20:23:03.4938021Z Nothing to do.
2025-05-07T20:23:03.4938770Z Complete!
2025-05-07T20:23:03.5327224Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:03.5327822Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:03.5328731Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:03.8784717Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:03.9338723Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:04.5199855Z nvidia-container-toolkit                         14 kB/s | 833  B     00:00    
2025-05-07T20:23:04.5448887Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:04.5846633Z Dependencies resolved.
2025-05-07T20:23:04.6025395Z ================================================================================
2025-05-07T20:23:04.6026319Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:04.6026762Z ================================================================================
2025-05-07T20:23:04.6027068Z Downgrading:
2025-05-07T20:23:04.6027442Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:04.6028059Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:04.6028432Z 
2025-05-07T20:23:04.6028533Z Transaction Summary
2025-05-07T20:23:04.6028774Z ================================================================================
2025-05-07T20:23:04.6029092Z Downgrade  2 Packages
2025-05-07T20:23:04.6029240Z 
2025-05-07T20:23:04.6029348Z Total download size: 6.8 M
2025-05-07T20:23:04.6030226Z Downloading Packages:
2025-05-07T20:23:04.6545148Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64  24 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:04.6962189Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x  61 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:04.6970725Z --------------------------------------------------------------------------------
2025-05-07T20:23:04.6973594Z Total                                            73 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:04.6976055Z Running transaction check
2025-05-07T20:23:04.7081708Z Transaction check succeeded.
2025-05-07T20:23:04.7082096Z Running transaction test
2025-05-07T20:23:04.7376602Z Transaction test succeeded.
2025-05-07T20:23:04.7378447Z Running transaction
2025-05-07T20:23:05.2907612Z   Preparing        :                                                        1/1 
2025-05-07T20:23:05.3964480Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:05.3990124Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:05.4201828Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:05.4202536Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:05.4305739Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:05.4334365Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:06.8404870Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:06.8405816Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:06.8406683Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:06.8407568Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:06.9715291Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:06.9716132Z WARNING:
2025-05-07T20:23:06.9716371Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:06.9716630Z 
2025-05-07T20:23:06.9716730Z   Available Versions:
2025-05-07T20:23:06.9716901Z 
2025-05-07T20:23:06.9717014Z   Version 2023.7.20250331:
2025-05-07T20:23:06.9717320Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:06.9717590Z 
2025-05-07T20:23:06.9717718Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:06.9717940Z 
2025-05-07T20:23:06.9718023Z     Release notes:
2025-05-07T20:23:06.9718444Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:06.9718832Z 
2025-05-07T20:23:06.9718920Z   Version 2023.7.20250414:
2025-05-07T20:23:06.9719239Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:06.9719499Z 
2025-05-07T20:23:06.9719625Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:06.9719842Z 
2025-05-07T20:23:06.9719997Z     Release notes:
2025-05-07T20:23:06.9720658Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:06.9721074Z 
2025-05-07T20:23:06.9721256Z   Version 2023.7.20250428:
2025-05-07T20:23:06.9721628Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:06.9721992Z 
2025-05-07T20:23:06.9722151Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:06.9722395Z 
2025-05-07T20:23:06.9722601Z     Release notes:
2025-05-07T20:23:06.9723084Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:06.9723547Z 
2025-05-07T20:23:06.9734704Z ================================================================================
2025-05-07T20:23:07.0080969Z  
2025-05-07T20:23:07.0081175Z 
2025-05-07T20:23:07.0081312Z Downgraded:
2025-05-07T20:23:07.0081706Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:07.0082313Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:07.0082696Z 
2025-05-07T20:23:07.0082908Z Complete!
2025-05-07T20:23:07.0530792Z + sudo systemctl restart docker
2025-05-07T20:23:11.2009166Z Wed May  7 20:23:11 2025       
2025-05-07T20:23:11.2009615Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:11.2010157Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:11.2010673Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:11.2011200Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:11.2011760Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:11.2012221Z |                                         |                        |               MIG M. |
2025-05-07T20:23:11.2012672Z |=========================================+========================+======================|
2025-05-07T20:23:11.2091597Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:11.2092421Z |  0%   30C    P0             64W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:11.2092838Z |                                         |                        |                  N/A |
2025-05-07T20:23:11.2093245Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:11.2093658Z                                                                                          
2025-05-07T20:23:11.2094061Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:11.2094502Z | Processes:                                                                              |
2025-05-07T20:23:11.2094959Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:11.2095532Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:11.2095888Z |=========================================================================================|
2025-05-07T20:23:11.2097193Z |  No running processes found                                                             |
2025-05-07T20:23:11.2097687Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:11.8255575Z Command completed after 1 attempt(s).
2025-05-07T20:23:11.8340965Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:11.8341455Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:11.8356312Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:11.8356683Z env:
2025-05-07T20:23:11.8356913Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:11.8357213Z   BUILD_ENV: build_binary
2025-05-07T20:23:11.8357459Z   BUILD_TARGET: genai
2025-05-07T20:23:11.8357695Z   BUILD_VARIANT: cuda
2025-05-07T20:23:11.8357945Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:11.8358229Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:11.8358534Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:11.8358868Z ##[endgroup]
2025-05-07T20:23:12.1703817Z ################################################################################
2025-05-07T20:23:12.1704170Z # Print System Info
2025-05-07T20:23:12.1704382Z #
2025-05-07T20:23:12.1720170Z # [2025-05-07T20:23:12.171Z] + print_system_info 
2025-05-07T20:23:12.1720519Z ################################################################################
2025-05-07T20:23:12.1720769Z 
2025-05-07T20:23:12.1720881Z ################################################################################
2025-05-07T20:23:12.1721217Z [INFO] Printing environment variables ...
2025-05-07T20:23:12.1721516Z + printenv
2025-05-07T20:23:12.1721627Z 
2025-05-07T20:23:12.1743383Z SHELL=/bin/bash
2025-05-07T20:23:12.1743742Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:12.1744302Z BUILD_VARIANT=cuda
2025-05-07T20:23:12.1744940Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_c71ea148-c953-4dc1-a8bb-b70fcbecd39b
2025-05-07T20:23:12.1745653Z GITHUB_ACTION=__run
2025-05-07T20:23:12.1745949Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.1746299Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:12.1746545Z RUNNER_NAME=i-0efa96680de6b8d22
2025-05-07T20:23:12.1746834Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:12.1747148Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:12.1747409Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:12.1747787Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:12.1748241Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:12.1748519Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:12.1748807Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:12.1749350Z ***
2025-05-07T20:23:12.1749557Z LOGNAME=ec2-user
2025-05-07T20:23:12.1749790Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:12.1750205Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:12.1750437Z GITHUB_ACTIONS=true
2025-05-07T20:23:12.1750658Z SYSTEMD_EXEC_PID=55516
2025-05-07T20:23:12.1750934Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:12.1751503Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:12.1752035Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:12.1752311Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:12.1752575Z RUNNER_OS=Linux
2025-05-07T20:23:12.1752799Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:12.1753044Z HOME=/home/ec2-user
2025-05-07T20:23:12.1753290Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:12.1753587Z LANG=C.UTF-8
2025-05-07T20:23:12.1753881Z RUNNER_TRACKING_ID=github_f62a49ac-39e7-4f59-b3ca-31d00a76a701
2025-05-07T20:23:12.1754247Z RUNNER_ARCH=X64
2025-05-07T20:23:12.1754512Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:12.1755199Z BUILD_TARGET=genai
2025-05-07T20:23:12.1755753Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_c71ea148-c953-4dc1-a8bb-b70fcbecd39b
2025-05-07T20:23:12.1756678Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_c71ea148-c953-4dc1-a8bb-b70fcbecd39b
2025-05-07T20:23:12.1757452Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:12.1758156Z INVOCATION_ID=aab5aa5f0aac458c98e693b092c8fb0e
2025-05-07T20:23:12.1758491Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:12.1758749Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:12.1759361Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_c71ea148-c953-4dc1-a8bb-b70fcbecd39b
2025-05-07T20:23:12.1760058Z BUILD_ENV=build_binary
2025-05-07T20:23:12.1760288Z GITHUB_ACTOR=q10
2025-05-07T20:23:12.1760498Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:12.1760720Z KERN_NAME_LC=linux
2025-05-07T20:23:12.1760948Z BUILD_CUDA_VERSION=12.8.0
2025-05-07T20:23:12.1761240Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:12.1761586Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:12.1761838Z USER=ec2-user
2025-05-07T20:23:12.1762063Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:12.1762344Z SHLVL=1
2025-05-07T20:23:12.1762531Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:12.1762841Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:12.1763298Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:12.1763675Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:12.1763905Z KERN_NAME=Linux
2025-05-07T20:23:12.1764132Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:12.1764547Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:12.1764990Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:12.1765262Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:12.1765511Z JOURNAL_STREAM=8:82613
2025-05-07T20:23:12.1765834Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:12.1766203Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:12.1766513Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:12.1766853Z GITHUB_BASE_REF=main
2025-05-07T20:23:12.1767064Z CI=true
2025-05-07T20:23:12.1767266Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:12.1767549Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:12.1767826Z GITHUB_ACTION_REF=
2025-05-07T20:23:12.1768076Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:12.1768718Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_c71ea148-c953-4dc1-a8bb-b70fcbecd39b
2025-05-07T20:23:12.1769333Z MACHINE_NAME=x86_64
2025-05-07T20:23:12.1769554Z _=/usr/bin/printenv
2025-05-07T20:23:12.1769693Z 
2025-05-07T20:23:12.1769809Z ################################################################################
2025-05-07T20:23:12.1770135Z [INFO] Print ldd version ...
2025-05-07T20:23:12.1770388Z + ldd --version
2025-05-07T20:23:12.1770521Z 
2025-05-07T20:23:12.1770621Z ldd (GNU libc) 2.34
2025-05-07T20:23:12.1770885Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:12.1771334Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:12.1771887Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:12.1772346Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:12.1772569Z 
2025-05-07T20:23:12.1772689Z ################################################################################
2025-05-07T20:23:12.1773000Z [INFO] Print CPU info ...
2025-05-07T20:23:12.1773237Z + nproc
2025-05-07T20:23:12.1773343Z 
2025-05-07T20:23:12.1789722Z 16
2025-05-07T20:23:12.1791662Z 
2025-05-07T20:23:12.1791903Z + lscpu
2025-05-07T20:23:12.1792012Z 
2025-05-07T20:23:12.1905591Z Architecture:                         x86_64
2025-05-07T20:23:12.1906097Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:12.1906790Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:12.1907194Z Byte Order:                           Little Endian
2025-05-07T20:23:12.1907523Z CPU(s):                               16
2025-05-07T20:23:12.1907820Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:12.1908148Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:12.1908496Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:12.1908809Z CPU family:                           23
2025-05-07T20:23:12.1909253Z Model:                                49
2025-05-07T20:23:12.1909551Z Thread(s) per core:                   2
2025-05-07T20:23:12.1909836Z Core(s) per socket:                   8
2025-05-07T20:23:12.1910287Z Socket(s):                            1
2025-05-07T20:23:12.1910576Z Stepping:                             0
2025-05-07T20:23:12.1910885Z BogoMIPS:                             5600.00
2025-05-07T20:23:12.1913153Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.1915442Z Hypervisor vendor:                    KVM
2025-05-07T20:23:12.1915751Z Virtualization type:                  full
2025-05-07T20:23:12.1916096Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.1916477Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.1916838Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:12.1917204Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:12.1917534Z NUMA node(s):                         1
2025-05-07T20:23:12.1917832Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:12.1918167Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:12.1918550Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:12.1918943Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:12.1919300Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:12.1919672Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:12.1920047Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:12.1920413Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:12.1920993Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:12.1921723Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:12.1922482Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:12.1923210Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:12.1924390Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:12.1925276Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:12.1925659Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:12.1925991Z 
2025-05-07T20:23:12.1926084Z + cat /proc/cpuinfo
2025-05-07T20:23:12.1926220Z 
2025-05-07T20:23:12.1926313Z processor	: 0
2025-05-07T20:23:12.1926529Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.1926772Z cpu family	: 23
2025-05-07T20:23:12.1926984Z model		: 49
2025-05-07T20:23:12.1927189Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.1927437Z stepping	: 0
2025-05-07T20:23:12.1927652Z microcode	: 0x830107f
2025-05-07T20:23:12.1927977Z cpu MHz		: 3290.947
2025-05-07T20:23:12.1928190Z cache size	: 512 KB
2025-05-07T20:23:12.1928402Z physical id	: 0
2025-05-07T20:23:12.1928602Z siblings	: 16
2025-05-07T20:23:12.1928801Z core id		: 0
2025-05-07T20:23:12.1928995Z cpu cores	: 8
2025-05-07T20:23:12.1929187Z apicid		: 0
2025-05-07T20:23:12.1929388Z initial apicid	: 0
2025-05-07T20:23:12.1929595Z fpu		: yes
2025-05-07T20:23:12.1929790Z fpu_exception	: yes
2025-05-07T20:23:12.1930008Z cpuid level	: 13
2025-05-07T20:23:12.1930214Z wp		: yes
2025-05-07T20:23:12.1932450Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.1934907Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.1935419Z bogomips	: 5600.00
2025-05-07T20:23:12.1935647Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.1935882Z clflush size	: 64
2025-05-07T20:23:12.1936091Z cache_alignment	: 64
2025-05-07T20:23:12.1936361Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.1936695Z power management:
2025-05-07T20:23:12.1936826Z 
2025-05-07T20:23:12.1936913Z processor	: 1
2025-05-07T20:23:12.1937132Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.1937371Z cpu family	: 23
2025-05-07T20:23:12.1937569Z model		: 49
2025-05-07T20:23:12.1937774Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.1938023Z stepping	: 0
2025-05-07T20:23:12.1938222Z microcode	: 0x830107f
2025-05-07T20:23:12.1938451Z cpu MHz		: 2829.392
2025-05-07T20:23:12.1938663Z cache size	: 512 KB
2025-05-07T20:23:12.1938874Z physical id	: 0
2025-05-07T20:23:12.1939082Z siblings	: 16
2025-05-07T20:23:12.1939280Z core id		: 1
2025-05-07T20:23:12.1939469Z cpu cores	: 8
2025-05-07T20:23:12.1939667Z apicid		: 2
2025-05-07T20:23:12.1939859Z initial apicid	: 2
2025-05-07T20:23:12.1940073Z fpu		: yes
2025-05-07T20:23:12.1940264Z fpu_exception	: yes
2025-05-07T20:23:12.1940479Z cpuid level	: 13
2025-05-07T20:23:12.1940684Z wp		: yes
2025-05-07T20:23:12.1942820Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.1945256Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.1945770Z bogomips	: 5600.00
2025-05-07T20:23:12.1945987Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.1946216Z clflush size	: 64
2025-05-07T20:23:12.1946428Z cache_alignment	: 64
2025-05-07T20:23:12.1946694Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.1947012Z power management:
2025-05-07T20:23:12.1947148Z 
2025-05-07T20:23:12.1947232Z processor	: 2
2025-05-07T20:23:12.1947443Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.1947680Z cpu family	: 23
2025-05-07T20:23:12.1947876Z model		: 49
2025-05-07T20:23:12.1948101Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.1948363Z stepping	: 0
2025-05-07T20:23:12.1948561Z microcode	: 0x830107f
2025-05-07T20:23:12.1948784Z cpu MHz		: 3303.588
2025-05-07T20:23:12.1948996Z cache size	: 512 KB
2025-05-07T20:23:12.1949200Z physical id	: 0
2025-05-07T20:23:12.1949405Z siblings	: 16
2025-05-07T20:23:12.1949688Z core id		: 2
2025-05-07T20:23:12.1949883Z cpu cores	: 8
2025-05-07T20:23:12.1950223Z apicid		: 4
2025-05-07T20:23:12.1950416Z initial apicid	: 4
2025-05-07T20:23:12.1950626Z fpu		: yes
2025-05-07T20:23:12.1950817Z fpu_exception	: yes
2025-05-07T20:23:12.1951031Z cpuid level	: 13
2025-05-07T20:23:12.1951237Z wp		: yes
2025-05-07T20:23:12.1953483Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.1955916Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.1956426Z bogomips	: 5600.00
2025-05-07T20:23:12.1956646Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.1956886Z clflush size	: 64
2025-05-07T20:23:12.1957094Z cache_alignment	: 64
2025-05-07T20:23:12.1957379Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.1957708Z power management:
2025-05-07T20:23:12.1957839Z 
2025-05-07T20:23:12.1957920Z processor	: 3
2025-05-07T20:23:12.1958136Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.1958375Z cpu family	: 23
2025-05-07T20:23:12.1958580Z model		: 49
2025-05-07T20:23:12.1958783Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.1959027Z stepping	: 0
2025-05-07T20:23:12.1959233Z microcode	: 0x830107f
2025-05-07T20:23:12.1959450Z cpu MHz		: 3162.450
2025-05-07T20:23:12.1959663Z cache size	: 512 KB
2025-05-07T20:23:12.1959880Z physical id	: 0
2025-05-07T20:23:12.1960080Z siblings	: 16
2025-05-07T20:23:12.1960277Z core id		: 3
2025-05-07T20:23:12.1960479Z cpu cores	: 8
2025-05-07T20:23:12.1960670Z apicid		: 6
2025-05-07T20:23:12.1960866Z initial apicid	: 6
2025-05-07T20:23:12.1961071Z fpu		: yes
2025-05-07T20:23:12.1961261Z fpu_exception	: yes
2025-05-07T20:23:12.1961477Z cpuid level	: 13
2025-05-07T20:23:12.1961684Z wp		: yes
2025-05-07T20:23:12.1963852Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.1980178Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.1980732Z bogomips	: 5600.00
2025-05-07T20:23:12.1980960Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.1981210Z clflush size	: 64
2025-05-07T20:23:12.1981434Z cache_alignment	: 64
2025-05-07T20:23:12.1981709Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.1982042Z power management:
2025-05-07T20:23:12.1982176Z 
2025-05-07T20:23:12.1982270Z processor	: 4
2025-05-07T20:23:12.1982483Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.1982728Z cpu family	: 23
2025-05-07T20:23:12.1983253Z model		: 49
2025-05-07T20:23:12.1983486Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.1983876Z stepping	: 0
2025-05-07T20:23:12.1984089Z microcode	: 0x830107f
2025-05-07T20:23:12.1984314Z cpu MHz		: 3304.110
2025-05-07T20:23:12.1984523Z cache size	: 512 KB
2025-05-07T20:23:12.1984739Z physical id	: 0
2025-05-07T20:23:12.1984949Z siblings	: 16
2025-05-07T20:23:12.1985144Z core id		: 4
2025-05-07T20:23:12.1985344Z cpu cores	: 8
2025-05-07T20:23:12.1985545Z apicid		: 8
2025-05-07T20:23:12.1985908Z initial apicid	: 8
2025-05-07T20:23:12.1986119Z fpu		: yes
2025-05-07T20:23:12.1986320Z fpu_exception	: yes
2025-05-07T20:23:12.1986528Z cpuid level	: 13
2025-05-07T20:23:12.1986738Z wp		: yes
2025-05-07T20:23:12.1988994Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.1991562Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.1992072Z bogomips	: 5600.00
2025-05-07T20:23:12.1992301Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.1992543Z clflush size	: 64
2025-05-07T20:23:12.1992756Z cache_alignment	: 64
2025-05-07T20:23:12.1993026Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.1993354Z power management:
2025-05-07T20:23:12.1993483Z 
2025-05-07T20:23:12.1993577Z processor	: 5
2025-05-07T20:23:12.1993781Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.1994035Z cpu family	: 23
2025-05-07T20:23:12.1994323Z model		: 49
2025-05-07T20:23:12.1994582Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.1994845Z stepping	: 0
2025-05-07T20:23:12.1995060Z microcode	: 0x830107f
2025-05-07T20:23:12.1995285Z cpu MHz		: 3299.789
2025-05-07T20:23:12.1995505Z cache size	: 512 KB
2025-05-07T20:23:12.1995726Z physical id	: 0
2025-05-07T20:23:12.1995932Z siblings	: 16
2025-05-07T20:23:12.1996147Z core id		: 5
2025-05-07T20:23:12.1996426Z cpu cores	: 8
2025-05-07T20:23:12.1996700Z apicid		: 10
2025-05-07T20:23:12.1996995Z initial apicid	: 10
2025-05-07T20:23:12.1997295Z fpu		: yes
2025-05-07T20:23:12.1997582Z fpu_exception	: yes
2025-05-07T20:23:12.1997888Z cpuid level	: 13
2025-05-07T20:23:12.1998256Z wp		: yes
2025-05-07T20:23:12.2001371Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2004200Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2004730Z bogomips	: 5600.00
2025-05-07T20:23:12.2004967Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2005217Z clflush size	: 64
2025-05-07T20:23:12.2005444Z cache_alignment	: 64
2025-05-07T20:23:12.2005728Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2006069Z power management:
2025-05-07T20:23:12.2006209Z 
2025-05-07T20:23:12.2006306Z processor	: 6
2025-05-07T20:23:12.2006521Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2006773Z cpu family	: 23
2025-05-07T20:23:12.2006991Z model		: 49
2025-05-07T20:23:12.2007196Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2007434Z stepping	: 0
2025-05-07T20:23:12.2007640Z microcode	: 0x830107f
2025-05-07T20:23:12.2007870Z cpu MHz		: 3164.899
2025-05-07T20:23:12.2008095Z cache size	: 512 KB
2025-05-07T20:23:12.2008318Z physical id	: 0
2025-05-07T20:23:12.2008527Z siblings	: 16
2025-05-07T20:23:12.2008736Z core id		: 6
2025-05-07T20:23:12.2008944Z cpu cores	: 8
2025-05-07T20:23:12.2009150Z apicid		: 12
2025-05-07T20:23:12.2009363Z initial apicid	: 12
2025-05-07T20:23:12.2009585Z fpu		: yes
2025-05-07T20:23:12.2009783Z fpu_exception	: yes
2025-05-07T20:23:12.2010016Z cpuid level	: 13
2025-05-07T20:23:12.2010356Z wp		: yes
2025-05-07T20:23:12.2012581Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2015029Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2015542Z bogomips	: 5600.00
2025-05-07T20:23:12.2015769Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2016012Z clflush size	: 64
2025-05-07T20:23:12.2016224Z cache_alignment	: 64
2025-05-07T20:23:12.2016499Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2016823Z power management:
2025-05-07T20:23:12.2016952Z 
2025-05-07T20:23:12.2017032Z processor	: 7
2025-05-07T20:23:12.2017248Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2017487Z cpu family	: 23
2025-05-07T20:23:12.2017688Z model		: 49
2025-05-07T20:23:12.2017908Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2018168Z stepping	: 0
2025-05-07T20:23:12.2018378Z microcode	: 0x830107f
2025-05-07T20:23:12.2018613Z cpu MHz		: 3216.010
2025-05-07T20:23:12.2018847Z cache size	: 512 KB
2025-05-07T20:23:12.2019060Z physical id	: 0
2025-05-07T20:23:12.2019276Z siblings	: 16
2025-05-07T20:23:12.2019482Z core id		: 7
2025-05-07T20:23:12.2019682Z cpu cores	: 8
2025-05-07T20:23:12.2019891Z apicid		: 14
2025-05-07T20:23:12.2020103Z initial apicid	: 14
2025-05-07T20:23:12.2020321Z fpu		: yes
2025-05-07T20:23:12.2020528Z fpu_exception	: yes
2025-05-07T20:23:12.2020758Z cpuid level	: 13
2025-05-07T20:23:12.2020969Z wp		: yes
2025-05-07T20:23:12.2023116Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2025558Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2026069Z bogomips	: 5600.00
2025-05-07T20:23:12.2026280Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2026514Z clflush size	: 64
2025-05-07T20:23:12.2026726Z cache_alignment	: 64
2025-05-07T20:23:12.2026989Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2027313Z power management:
2025-05-07T20:23:12.2027444Z 
2025-05-07T20:23:12.2027528Z processor	: 8
2025-05-07T20:23:12.2027737Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2027975Z cpu family	: 23
2025-05-07T20:23:12.2028177Z model		: 49
2025-05-07T20:23:12.2028379Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2028626Z stepping	: 0
2025-05-07T20:23:12.2028840Z microcode	: 0x830107f
2025-05-07T20:23:12.2029062Z cpu MHz		: 3297.080
2025-05-07T20:23:12.2029282Z cache size	: 512 KB
2025-05-07T20:23:12.2029500Z physical id	: 0
2025-05-07T20:23:12.2029707Z siblings	: 16
2025-05-07T20:23:12.2029903Z core id		: 0
2025-05-07T20:23:12.2030254Z cpu cores	: 8
2025-05-07T20:23:12.2030452Z apicid		: 1
2025-05-07T20:23:12.2030653Z initial apicid	: 1
2025-05-07T20:23:12.2030870Z fpu		: yes
2025-05-07T20:23:12.2031064Z fpu_exception	: yes
2025-05-07T20:23:12.2031288Z cpuid level	: 13
2025-05-07T20:23:12.2031500Z wp		: yes
2025-05-07T20:23:12.2033634Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2036281Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2036787Z bogomips	: 5600.00
2025-05-07T20:23:12.2037002Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2037237Z clflush size	: 64
2025-05-07T20:23:12.2037443Z cache_alignment	: 64
2025-05-07T20:23:12.2037711Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2038032Z power management:
2025-05-07T20:23:12.2038162Z 
2025-05-07T20:23:12.2038248Z processor	: 9
2025-05-07T20:23:12.2038464Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2038708Z cpu family	: 23
2025-05-07T20:23:12.2038914Z model		: 49
2025-05-07T20:23:12.2039124Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2039374Z stepping	: 0
2025-05-07T20:23:12.2039579Z microcode	: 0x830107f
2025-05-07T20:23:12.2039809Z cpu MHz		: 3208.471
2025-05-07T20:23:12.2040027Z cache size	: 512 KB
2025-05-07T20:23:12.2040300Z physical id	: 0
2025-05-07T20:23:12.2040513Z siblings	: 16
2025-05-07T20:23:12.2040719Z core id		: 1
2025-05-07T20:23:12.2040930Z cpu cores	: 8
2025-05-07T20:23:12.2041125Z apicid		: 3
2025-05-07T20:23:12.2041328Z initial apicid	: 3
2025-05-07T20:23:12.2041545Z fpu		: yes
2025-05-07T20:23:12.2041739Z fpu_exception	: yes
2025-05-07T20:23:12.2041960Z cpuid level	: 13
2025-05-07T20:23:12.2042175Z wp		: yes
2025-05-07T20:23:12.2044319Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2046776Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2047296Z bogomips	: 5600.00
2025-05-07T20:23:12.2047541Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2047874Z clflush size	: 64
2025-05-07T20:23:12.2048177Z cache_alignment	: 64
2025-05-07T20:23:12.2048559Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2048999Z power management:
2025-05-07T20:23:12.2049180Z 
2025-05-07T20:23:12.2049278Z processor	: 10
2025-05-07T20:23:12.2049498Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2049746Z cpu family	: 23
2025-05-07T20:23:12.2049950Z model		: 49
2025-05-07T20:23:12.2050160Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2050412Z stepping	: 0
2025-05-07T20:23:12.2050615Z microcode	: 0x830107f
2025-05-07T20:23:12.2050842Z cpu MHz		: 3306.026
2025-05-07T20:23:12.2051060Z cache size	: 512 KB
2025-05-07T20:23:12.2051277Z physical id	: 0
2025-05-07T20:23:12.2051489Z siblings	: 16
2025-05-07T20:23:12.2051695Z core id		: 2
2025-05-07T20:23:12.2051890Z cpu cores	: 8
2025-05-07T20:23:12.2052099Z apicid		: 5
2025-05-07T20:23:12.2052384Z initial apicid	: 5
2025-05-07T20:23:12.2052679Z fpu		: yes
2025-05-07T20:23:12.2052946Z fpu_exception	: yes
2025-05-07T20:23:12.2053239Z cpuid level	: 13
2025-05-07T20:23:12.2053513Z wp		: yes
2025-05-07T20:23:12.2055704Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2058709Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2059288Z bogomips	: 5600.00
2025-05-07T20:23:12.2059607Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2059834Z clflush size	: 64
2025-05-07T20:23:12.2060042Z cache_alignment	: 64
2025-05-07T20:23:12.2060311Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2060622Z power management:
2025-05-07T20:23:12.2060757Z 
2025-05-07T20:23:12.2060838Z processor	: 11
2025-05-07T20:23:12.2061052Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2061279Z cpu family	: 23
2025-05-07T20:23:12.2061483Z model		: 49
2025-05-07T20:23:12.2061696Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2061925Z stepping	: 0
2025-05-07T20:23:12.2062120Z microcode	: 0x830107f
2025-05-07T20:23:12.2062342Z cpu MHz		: 3020.382
2025-05-07T20:23:12.2062544Z cache size	: 512 KB
2025-05-07T20:23:12.2062753Z physical id	: 0
2025-05-07T20:23:12.2063037Z siblings	: 16
2025-05-07T20:23:12.2063301Z core id		: 3
2025-05-07T20:23:12.2063568Z cpu cores	: 8
2025-05-07T20:23:12.2063834Z apicid		: 7
2025-05-07T20:23:12.2064072Z initial apicid	: 7
2025-05-07T20:23:12.2064292Z fpu		: yes
2025-05-07T20:23:12.2064492Z fpu_exception	: yes
2025-05-07T20:23:12.2064735Z cpuid level	: 13
2025-05-07T20:23:12.2065017Z wp		: yes
2025-05-07T20:23:12.2067426Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2069873Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2070558Z bogomips	: 5600.00
2025-05-07T20:23:12.2070766Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2071001Z clflush size	: 64
2025-05-07T20:23:12.2071218Z cache_alignment	: 64
2025-05-07T20:23:12.2071482Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2071804Z power management:
2025-05-07T20:23:12.2071933Z 
2025-05-07T20:23:12.2072020Z processor	: 12
2025-05-07T20:23:12.2072229Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2072466Z cpu family	: 23
2025-05-07T20:23:12.2072668Z model		: 49
2025-05-07T20:23:12.2072866Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2073105Z stepping	: 0
2025-05-07T20:23:12.2073306Z microcode	: 0x830107f
2025-05-07T20:23:12.2073523Z cpu MHz		: 3310.035
2025-05-07T20:23:12.2073726Z cache size	: 512 KB
2025-05-07T20:23:12.2073936Z physical id	: 0
2025-05-07T20:23:12.2074135Z siblings	: 16
2025-05-07T20:23:12.2074326Z core id		: 4
2025-05-07T20:23:12.2074519Z cpu cores	: 8
2025-05-07T20:23:12.2074720Z apicid		: 9
2025-05-07T20:23:12.2074908Z initial apicid	: 9
2025-05-07T20:23:12.2075171Z fpu		: yes
2025-05-07T20:23:12.2075442Z fpu_exception	: yes
2025-05-07T20:23:12.2075734Z cpuid level	: 13
2025-05-07T20:23:12.2076014Z wp		: yes
2025-05-07T20:23:12.2078553Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2081121Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2081626Z bogomips	: 5600.00
2025-05-07T20:23:12.2081841Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2082077Z clflush size	: 64
2025-05-07T20:23:12.2082284Z cache_alignment	: 64
2025-05-07T20:23:12.2082646Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2083223Z power management:
2025-05-07T20:23:12.2083356Z 
2025-05-07T20:23:12.2083442Z processor	: 13
2025-05-07T20:23:12.2083648Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2083884Z cpu family	: 23
2025-05-07T20:23:12.2084088Z model		: 49
2025-05-07T20:23:12.2084283Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2084526Z stepping	: 0
2025-05-07T20:23:12.2084729Z microcode	: 0x830107f
2025-05-07T20:23:12.2084949Z cpu MHz		: 3299.717
2025-05-07T20:23:12.2085155Z cache size	: 512 KB
2025-05-07T20:23:12.2085367Z physical id	: 0
2025-05-07T20:23:12.2085563Z siblings	: 16
2025-05-07T20:23:12.2085758Z core id		: 5
2025-05-07T20:23:12.2085957Z cpu cores	: 8
2025-05-07T20:23:12.2086145Z apicid		: 11
2025-05-07T20:23:12.2086340Z initial apicid	: 11
2025-05-07T20:23:12.2086544Z fpu		: yes
2025-05-07T20:23:12.2086732Z fpu_exception	: yes
2025-05-07T20:23:12.2086941Z cpuid level	: 13
2025-05-07T20:23:12.2087141Z wp		: yes
2025-05-07T20:23:12.2089622Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2092455Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2092969Z bogomips	: 5600.00
2025-05-07T20:23:12.2093192Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2093432Z clflush size	: 64
2025-05-07T20:23:12.2093644Z cache_alignment	: 64
2025-05-07T20:23:12.2093918Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2094244Z power management:
2025-05-07T20:23:12.2094375Z 
2025-05-07T20:23:12.2094461Z processor	: 14
2025-05-07T20:23:12.2094677Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2094916Z cpu family	: 23
2025-05-07T20:23:12.2095116Z model		: 49
2025-05-07T20:23:12.2095322Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2095564Z stepping	: 0
2025-05-07T20:23:12.2095767Z microcode	: 0x830107f
2025-05-07T20:23:12.2095993Z cpu MHz		: 3310.359
2025-05-07T20:23:12.2096214Z cache size	: 512 KB
2025-05-07T20:23:12.2096427Z physical id	: 0
2025-05-07T20:23:12.2096634Z siblings	: 16
2025-05-07T20:23:12.2096831Z core id		: 6
2025-05-07T20:23:12.2097028Z cpu cores	: 8
2025-05-07T20:23:12.2097226Z apicid		: 13
2025-05-07T20:23:12.2097432Z initial apicid	: 13
2025-05-07T20:23:12.2097638Z fpu		: yes
2025-05-07T20:23:12.2097841Z fpu_exception	: yes
2025-05-07T20:23:12.2098057Z cpuid level	: 13
2025-05-07T20:23:12.2098276Z wp		: yes
2025-05-07T20:23:12.2100431Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2103677Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2104329Z bogomips	: 5600.00
2025-05-07T20:23:12.2104547Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2104777Z clflush size	: 64
2025-05-07T20:23:12.2104988Z cache_alignment	: 64
2025-05-07T20:23:12.2105248Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2105562Z power management:
2025-05-07T20:23:12.2105691Z 
2025-05-07T20:23:12.2105927Z processor	: 15
2025-05-07T20:23:12.2106144Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.2106379Z cpu family	: 23
2025-05-07T20:23:12.2106574Z model		: 49
2025-05-07T20:23:12.2106774Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.2107017Z stepping	: 0
2025-05-07T20:23:12.2107217Z microcode	: 0x830107f
2025-05-07T20:23:12.2107439Z cpu MHz		: 2912.833
2025-05-07T20:23:12.2107650Z cache size	: 512 KB
2025-05-07T20:23:12.2107856Z physical id	: 0
2025-05-07T20:23:12.2108069Z siblings	: 16
2025-05-07T20:23:12.2108265Z core id		: 7
2025-05-07T20:23:12.2108451Z cpu cores	: 8
2025-05-07T20:23:12.2108650Z apicid		: 15
2025-05-07T20:23:12.2108849Z initial apicid	: 15
2025-05-07T20:23:12.2109058Z fpu		: yes
2025-05-07T20:23:12.2109256Z fpu_exception	: yes
2025-05-07T20:23:12.2109464Z cpuid level	: 13
2025-05-07T20:23:12.2109662Z wp		: yes
2025-05-07T20:23:12.2111917Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.2114368Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.2114878Z bogomips	: 5600.00
2025-05-07T20:23:12.2115095Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.2115326Z clflush size	: 64
2025-05-07T20:23:12.2115540Z cache_alignment	: 64
2025-05-07T20:23:12.2115809Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.2116124Z power management:
2025-05-07T20:23:12.2116257Z 
2025-05-07T20:23:12.2116262Z 
2025-05-07T20:23:12.2116380Z ################################################################################
2025-05-07T20:23:12.2116699Z [INFO] Print PCI info ...
2025-05-07T20:23:12.2116936Z + lspci -v
2025-05-07T20:23:12.2117054Z 
2025-05-07T20:23:12.2117274Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:12.2117674Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:12.2118004Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:12.2118219Z 
2025-05-07T20:23:12.2118429Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:12.2118842Z 	Physical Slot: 1
2025-05-07T20:23:12.2119181Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.2119475Z 
2025-05-07T20:23:12.2119836Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:12.2120422Z 	Physical Slot: 1
2025-05-07T20:23:12.2120683Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:12.2120922Z 
2025-05-07T20:23:12.2121198Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:12.2121665Z 	Physical Slot: 3
2025-05-07T20:23:12.2121902Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.2122249Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.2122612Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:12.2122843Z 
2025-05-07T20:23:12.2123152Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.2123822Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.2124114Z 	Physical Slot: 4
2025-05-07T20:23:12.2124373Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:12.2124756Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.2125115Z 	Capabilities: <access denied>
2025-05-07T20:23:12.2125387Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.2125554Z 
2025-05-07T20:23:12.2125858Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.2126353Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.2126708Z 	Physical Slot: 5
2025-05-07T20:23:12.2126942Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.2127301Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.2127689Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.2128023Z 	Capabilities: <access denied>
2025-05-07T20:23:12.2128290Z 	Kernel driver in use: ena
2025-05-07T20:23:12.2128532Z 	Kernel modules: ena
2025-05-07T20:23:12.2128670Z 
2025-05-07T20:23:12.2128847Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:12.2129228Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:12.2129524Z 	Physical Slot: 30
2025-05-07T20:23:12.2129779Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:12.2130160Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:12.2130561Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:12.2130941Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:12.2131278Z 	Capabilities: <access denied>
2025-05-07T20:23:12.2131539Z 	Kernel driver in use: nvidia
2025-05-07T20:23:12.2131796Z 	Kernel modules: nvidia
2025-05-07T20:23:12.2131941Z 
2025-05-07T20:23:12.2132265Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.2140179Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.2140503Z 	Physical Slot: 31
2025-05-07T20:23:12.2140756Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.2141134Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.2141536Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:12.2141873Z 	Capabilities: <access denied>
2025-05-07T20:23:12.2142145Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.2142312Z 
2025-05-07T20:23:12.2142317Z 
2025-05-07T20:23:12.2142447Z ################################################################################
2025-05-07T20:23:12.2142777Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:12.2143071Z + uname -a
2025-05-07T20:23:12.2143192Z 
2025-05-07T20:23:12.2143629Z Linux ip-10-0-51-101.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:12.2144164Z 
2025-05-07T20:23:12.2144253Z + uname -m
2025-05-07T20:23:12.2144367Z 
2025-05-07T20:23:12.2144441Z x86_64
2025-05-07T20:23:12.2144554Z 
2025-05-07T20:23:12.2144637Z + cat /proc/version
2025-05-07T20:23:12.2144770Z 
2025-05-07T20:23:12.2145350Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:12.2146022Z 
2025-05-07T20:23:12.2146109Z + cat /etc/os-release
2025-05-07T20:23:12.2146251Z 
2025-05-07T20:23:12.2146340Z NAME="Amazon Linux"
2025-05-07T20:23:12.2146557Z VERSION="2023"
2025-05-07T20:23:12.2146758Z ID="amzn"
2025-05-07T20:23:12.2146941Z ID_LIKE="fedora"
2025-05-07T20:23:12.2147145Z VERSION_ID="2023"
2025-05-07T20:23:12.2147370Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:12.2147646Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:12.2147936Z ANSI_COLOR="0;33"
2025-05-07T20:23:12.2148183Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:12.2148698Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:12.2149138Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:12.2149571Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:12.2150137Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:12.2150517Z VENDOR_NAME="AWS"
2025-05-07T20:23:12.2150758Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:12.2151051Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:12.2151205Z 
2025-05-07T20:23:12.2151413Z ################################################################################
2025-05-07T20:23:12.2151734Z # Print EC2 Instance Info
2025-05-07T20:23:12.2151965Z #
2025-05-07T20:23:12.2152177Z # [2025-05-07T20:23:12.211Z] + print_ec2_info 
2025-05-07T20:23:12.2152492Z ################################################################################
2025-05-07T20:23:12.2152718Z 
2025-05-07T20:23:12.2241172Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:12.2363955Z instance-id: i-0efa96680de6b8d22
2025-05-07T20:23:12.2479409Z instance-type: g5.4xlarge
2025-05-07T20:23:12.2520759Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:12.2521128Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:12.2531264Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:12.2531630Z env:
2025-05-07T20:23:12.2531845Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:12.2532152Z   BUILD_ENV: build_binary
2025-05-07T20:23:12.2532401Z   BUILD_TARGET: genai
2025-05-07T20:23:12.2532632Z   BUILD_VARIANT: cuda
2025-05-07T20:23:12.2532863Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:12.2533124Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:12.2533430Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.2533762Z ##[endgroup]
2025-05-07T20:23:12.5834332Z ################################################################################
2025-05-07T20:23:12.5834833Z [INFO] Printing general display info ...
2025-05-07T20:23:12.5866822Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:12.6976475Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:12.6986642Z /usr/bin/sudo
2025-05-07T20:23:12.6996963Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:12.7007811Z /usr/bin/yum
2025-05-07T20:23:12.7009436Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:12.7029473Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:13.1537045Z Last metadata expiration check: 0:00:09 ago on Wed May  7 20:23:04 2025.
2025-05-07T20:23:13.2258894Z ================================================================================
2025-05-07T20:23:13.2259405Z WARNING:
2025-05-07T20:23:13.2259692Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:13.2259936Z 
2025-05-07T20:23:13.2260030Z   Available Versions:
2025-05-07T20:23:13.2260189Z 
2025-05-07T20:23:13.2260279Z   Version 2023.7.20250331:
2025-05-07T20:23:13.2260604Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:13.2260894Z 
2025-05-07T20:23:13.2261026Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:13.2261245Z 
2025-05-07T20:23:13.2261335Z     Release notes:
2025-05-07T20:23:13.2261748Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:13.2262146Z 
2025-05-07T20:23:13.2262232Z   Version 2023.7.20250414:
2025-05-07T20:23:13.2262544Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:13.2262800Z 
2025-05-07T20:23:13.2262917Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:13.2263131Z 
2025-05-07T20:23:13.2263213Z     Release notes:
2025-05-07T20:23:13.2263616Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:13.2263997Z 
2025-05-07T20:23:13.2264088Z   Version 2023.7.20250428:
2025-05-07T20:23:13.2264392Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:13.2264878Z 
2025-05-07T20:23:13.2264988Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:13.2265216Z 
2025-05-07T20:23:13.2265299Z     Release notes:
2025-05-07T20:23:13.2265697Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:13.2266077Z 
2025-05-07T20:23:13.2266188Z ================================================================================
2025-05-07T20:23:13.3419238Z Dependencies resolved.
2025-05-07T20:23:13.3703884Z ================================================================================
2025-05-07T20:23:13.3704322Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:13.3704743Z ================================================================================
2025-05-07T20:23:13.3705049Z Upgrading:
2025-05-07T20:23:13.3705416Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:13.3706024Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:13.3706418Z 
2025-05-07T20:23:13.3706799Z Transaction Summary
2025-05-07T20:23:13.3707059Z ================================================================================
2025-05-07T20:23:13.3707372Z Upgrade  2 Packages
2025-05-07T20:23:13.3707513Z 
2025-05-07T20:23:13.3707614Z Total download size: 6.9 M
2025-05-07T20:23:13.3708423Z Downloading Packages:
2025-05-07T20:23:13.4224423Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  25 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:13.4476703Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  75 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:13.4486129Z --------------------------------------------------------------------------------
2025-05-07T20:23:13.4489022Z Total                                            89 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:13.4491520Z Running transaction check
2025-05-07T20:23:13.4586810Z Transaction check succeeded.
2025-05-07T20:23:13.4587283Z Running transaction test
2025-05-07T20:23:13.4881208Z Transaction test succeeded.
2025-05-07T20:23:13.4884174Z Running transaction
2025-05-07T20:23:14.0427946Z   Preparing        :                                                        1/1 
2025-05-07T20:23:14.1485830Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:14.1505919Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.1707320Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.1708129Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:14.1810031Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:14.1832226Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:14.3253198Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:14.3254397Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:14.3255558Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:14.3256666Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:14.4694814Z ================================================================================
2025-05-07T20:23:14.4695247Z WARNING:
2025-05-07T20:23:14.4695533Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:14.4695777Z 
2025-05-07T20:23:14.4695878Z   Available Versions:
2025-05-07T20:23:14.4696031Z 
2025-05-07T20:23:14.4696127Z   Version 2023.7.20250331:
2025-05-07T20:23:14.4696444Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:14.4696716Z 
2025-05-07T20:23:14.4696840Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:14.4697060Z 
2025-05-07T20:23:14.4697156Z     Release notes:
2025-05-07T20:23:14.4697580Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:14.4698249Z 
2025-05-07T20:23:14.4698348Z   Version 2023.7.20250414:
2025-05-07T20:23:14.4698666Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:14.4698923Z 
2025-05-07T20:23:14.4699047Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:14.4699264Z 
2025-05-07T20:23:14.4699353Z     Release notes:
2025-05-07T20:23:14.4699765Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:14.4700151Z 
2025-05-07T20:23:14.4700252Z   Version 2023.7.20250428:
2025-05-07T20:23:14.4700564Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:14.4700835Z 
2025-05-07T20:23:14.4700950Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:14.4701176Z 
2025-05-07T20:23:14.4701263Z     Release notes:
2025-05-07T20:23:14.4701672Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:14.4702063Z 
2025-05-07T20:23:14.4702410Z ================================================================================
2025-05-07T20:23:14.5255534Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:14.5255911Z 
2025-05-07T20:23:14.5256000Z Upgraded:
2025-05-07T20:23:14.5256367Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:14.5256966Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:14.5257338Z 
2025-05-07T20:23:14.5257423Z Complete!
2025-05-07T20:23:14.5696091Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:14.5718270Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:15.0169786Z Last metadata expiration check: 0:00:11 ago on Wed May  7 20:23:04 2025.
2025-05-07T20:23:15.0409991Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:15.0812665Z Dependencies resolved.
2025-05-07T20:23:15.0990964Z ================================================================================
2025-05-07T20:23:15.0991459Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:15.0991995Z ================================================================================
2025-05-07T20:23:15.0992352Z Installing:
2025-05-07T20:23:15.0992647Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:15.0992927Z 
2025-05-07T20:23:15.0993022Z Transaction Summary
2025-05-07T20:23:15.0993261Z ================================================================================
2025-05-07T20:23:15.0993568Z Install  1 Package
2025-05-07T20:23:15.0993698Z 
2025-05-07T20:23:15.0993818Z Total download size: 319 k
2025-05-07T20:23:15.0994474Z Installed size: 837 k
2025-05-07T20:23:15.0995684Z Downloading Packages:
2025-05-07T20:23:15.1870608Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        6.5 MB/s | 319 kB     00:00    
2025-05-07T20:23:15.1876196Z --------------------------------------------------------------------------------
2025-05-07T20:23:15.1879032Z Total                                           3.5 MB/s | 319 kB     00:00     
2025-05-07T20:23:15.2035122Z Running transaction check
2025-05-07T20:23:15.2090508Z Transaction check succeeded.
2025-05-07T20:23:15.2090938Z Running transaction test
2025-05-07T20:23:15.2544255Z Transaction test succeeded.
2025-05-07T20:23:15.2547806Z Running transaction
2025-05-07T20:23:15.3544363Z   Preparing        :                                                        1/1 
2025-05-07T20:23:15.4032752Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.5561550Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.6770964Z ================================================================================
2025-05-07T20:23:15.6771347Z WARNING:
2025-05-07T20:23:15.6771679Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:15.6772207Z 
2025-05-07T20:23:15.6772302Z   Available Versions:
2025-05-07T20:23:15.6772467Z 
2025-05-07T20:23:15.6772555Z   Version 2023.7.20250331:
2025-05-07T20:23:15.6772870Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:15.6773130Z 
2025-05-07T20:23:15.6773252Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:15.6773473Z 
2025-05-07T20:23:15.6773560Z     Release notes:
2025-05-07T20:23:15.6773977Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:15.6774366Z 
2025-05-07T20:23:15.6774461Z   Version 2023.7.20250414:
2025-05-07T20:23:15.6774764Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:15.6775030Z 
2025-05-07T20:23:15.6775149Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:15.6775366Z 
2025-05-07T20:23:15.6775455Z     Release notes:
2025-05-07T20:23:15.6775851Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:15.6776256Z 
2025-05-07T20:23:15.6776516Z   Version 2023.7.20250428:
2025-05-07T20:23:15.6776833Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:15.6777091Z 
2025-05-07T20:23:15.6777210Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:15.6777422Z 
2025-05-07T20:23:15.6777507Z     Release notes:
2025-05-07T20:23:15.6777912Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:15.6778300Z 
2025-05-07T20:23:15.6778411Z ================================================================================
2025-05-07T20:23:15.7115570Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:15.7115925Z 
2025-05-07T20:23:15.7116013Z Installed:
2025-05-07T20:23:15.7116323Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:15.7116634Z 
2025-05-07T20:23:15.7116713Z Complete!
2025-05-07T20:23:15.7558933Z + hostname
2025-05-07T20:23:15.7559138Z 
2025-05-07T20:23:15.7572871Z ip-10-0-51-101.ec2.internal
2025-05-07T20:23:15.7574268Z 
2025-05-07T20:23:15.7574683Z + sudo lshw -C display
2025-05-07T20:23:15.7574923Z 
2025-05-07T20:23:16.2697258Z   *-display:0 UNCLAIMED
2025-05-07T20:23:16.2697676Z        description: VGA compatible controller
2025-05-07T20:23:16.2698011Z        product: Amazon.com, Inc.
2025-05-07T20:23:16.2698282Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:16.2698544Z        physical id: 3
2025-05-07T20:23:16.2698783Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:16.2699065Z        version: 00
2025-05-07T20:23:16.2699297Z        width: 32 bits
2025-05-07T20:23:16.2699526Z        clock: 33MHz
2025-05-07T20:23:16.2699771Z        capabilities: vga_controller bus_master
2025-05-07T20:23:16.2700081Z        configuration: latency=0
2025-05-07T20:23:16.2700412Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:16.2700755Z   *-display:1
2025-05-07T20:23:16.2700965Z        description: 3D controller
2025-05-07T20:23:16.2701275Z        product: GA102GL [A10G]
2025-05-07T20:23:16.2701540Z        vendor: NVIDIA Corporation
2025-05-07T20:23:16.2701797Z        physical id: 1e
2025-05-07T20:23:16.2702030Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:16.2702284Z        version: a1
2025-05-07T20:23:16.2702483Z        width: 64 bits
2025-05-07T20:23:16.2702702Z        clock: 33MHz
2025-05-07T20:23:16.2702990Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:16.2703364Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:16.2704014Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:16.2736727Z 
2025-05-07T20:23:16.2737087Z ################################################################################
2025-05-07T20:23:16.2737447Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:16.2864342Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:16.3034182Z Wed May  7 20:23:16 2025       
2025-05-07T20:23:16.3034650Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.3035181Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:16.3035690Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:16.3036206Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:16.3036756Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:16.3037206Z |                                         |                        |               MIG M. |
2025-05-07T20:23:16.3037549Z |=========================================+========================+======================|
2025-05-07T20:23:16.3116189Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:16.3118129Z |  0%   30C    P0             60W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:16.3118567Z |                                         |                        |                  N/A |
2025-05-07T20:23:16.3118974Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:16.3119383Z                                                                                          
2025-05-07T20:23:16.3119787Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.3120234Z | Processes:                                                                              |
2025-05-07T20:23:16.3120688Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:16.3121116Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:16.3121478Z |=========================================================================================|
2025-05-07T20:23:16.3121913Z |  No running processes found                                                             |
2025-05-07T20:23:16.3122402Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.4565360Z ################################################################################
2025-05-07T20:23:16.4565732Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:16.4707883Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:16.4708895Z [CHECK] rocminfo not found
2025-05-07T20:23:16.4718003Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:16.4719153Z [CHECK] rocm-smi not found
2025-05-07T20:23:16.4780691Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:16.4781145Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:16.4793171Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:16.4793533Z env:
2025-05-07T20:23:16.4793755Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:16.4794055Z   BUILD_ENV: build_binary
2025-05-07T20:23:16.4794297Z   BUILD_TARGET: genai
2025-05-07T20:23:16.4794527Z   BUILD_VARIANT: cuda
2025-05-07T20:23:16.4794753Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:16.4795005Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:16.4795305Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:16.4795643Z ##[endgroup]
2025-05-07T20:23:16.8134557Z ################################################################################
2025-05-07T20:23:16.8134932Z # Setup Miniconda
2025-05-07T20:23:16.8135148Z #
2025-05-07T20:23:16.8149027Z # [2025-05-07T20:23:16.814Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:16.8149446Z ################################################################################
2025-05-07T20:23:16.8149674Z 
2025-05-07T20:23:16.8164618Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:16.9037705Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:16.9038075Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:16.9038277Z 
2025-05-07T20:23:16.9054216Z 
2025-05-07T20:23:16.9054565Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:16.9075716Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:17.9380649Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:17.9381156Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:17.9381502Z 
2025-05-07T20:23:17.9525527Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:18.3962745Z Unpacking payload ...
2025-05-07T20:23:18.9148462Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:19.7165956Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:21.8166958Z 
2025-05-07T20:23:21.8167482Z Installing base environment...
2025-05-07T20:23:21.8167712Z 
2025-05-07T20:23:22.9022572Z Preparing transaction: ...working... done
2025-05-07T20:23:25.8392099Z Executing transaction: ...working... done
2025-05-07T20:23:26.4991976Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:26.5880972Z installation finished.
2025-05-07T20:23:26.5890468Z 
2025-05-07T20:23:26.5890857Z + rm -f miniconda.sh
2025-05-07T20:23:26.5891041Z 
2025-05-07T20:23:26.6201428Z 
2025-05-07T20:23:26.6201800Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:23:26.6202166Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:23:26.6202425Z 
2025-05-07T20:23:26.9838789Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:23:26.9839379Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:23:26.9839938Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:23:26.9840444Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:23:26.9840977Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:23:26.9841554Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:23:26.9842172Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:23:26.9842794Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:23:26.9843445Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:23:26.9844650Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:23:26.9845433Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:23:26.9845983Z modified      /home/ec2-user/.bashrc
2025-05-07T20:23:26.9846268Z 
2025-05-07T20:23:26.9846582Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:23:26.9847019Z 
2025-05-07T20:23:27.0487788Z 
2025-05-07T20:23:27.0488406Z + . /home/ec2-user/.bashrc
2025-05-07T20:23:27.0488621Z 
2025-05-07T20:23:27.8827584Z 
2025-05-07T20:23:27.8828139Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:23:27.8852812Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:23:41.3557682Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:23:42.9419421Z Solving environment: / - \ | / - \ | / - \ | done
2025-05-07T20:23:43.0383536Z 
2025-05-07T20:23:43.0383886Z ## Package Plan ##
2025-05-07T20:23:43.0384043Z 
2025-05-07T20:23:43.0384208Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:43.0384477Z 
2025-05-07T20:23:43.0384574Z   added / updated specs:
2025-05-07T20:23:43.0384846Z     - conda-libmamba-solver
2025-05-07T20:23:43.0385098Z     - libarchive
2025-05-07T20:23:43.0385310Z     - libmamba
2025-05-07T20:23:43.0385514Z     - libmambapy
2025-05-07T20:23:43.0385641Z 
2025-05-07T20:23:43.0385645Z 
2025-05-07T20:23:43.0385782Z The following packages will be downloaded:
2025-05-07T20:23:43.0386013Z 
2025-05-07T20:23:43.0386125Z     package                    |            build
2025-05-07T20:23:43.0386456Z     ---------------------------|-----------------
2025-05-07T20:23:43.0386898Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:23:43.0387404Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:23:43.0387862Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:23:43.0388365Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:23:43.0388828Z     ------------------------------------------------------------
2025-05-07T20:23:43.0389184Z                                            Total:         1.4 MB
2025-05-07T20:23:43.0389406Z 
2025-05-07T20:23:43.0389517Z The following packages will be UPDATED:
2025-05-07T20:23:43.0389731Z 
2025-05-07T20:23:43.0395729Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:23:43.0396565Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:23:43.0396974Z 
2025-05-07T20:23:43.0397199Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:23:43.0397535Z 
2025-05-07T20:23:43.0397871Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:23:43.0398719Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:23:43.0399236Z 
2025-05-07T20:23:43.0399240Z 
2025-05-07T20:23:43.0399244Z 
2025-05-07T20:23:43.0399400Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:43.0399771Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:23:43.0400007Z 
2025-05-07T20:23:43.0400573Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:23:43.0400831Z 
2025-05-07T20:23:43.0400835Z 
2025-05-07T20:23:43.0404664Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:23:43.0404941Z 
2025-05-07T20:23:43.0405142Z 
2025-05-07T20:23:43.0412585Z 
2025-05-07T20:23:43.0927233Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:23:43.0927549Z 
2025-05-07T20:23:43.0927554Z 
2025-05-07T20:23:43.0927558Z 
2025-05-07T20:23:43.1033930Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.1034354Z 
2025-05-07T20:23:43.1034360Z 
2025-05-07T20:23:43.1034366Z 
2025-05-07T20:23:43.1065548Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.1066008Z 
2025-05-07T20:23:43.1066015Z 
2025-05-07T20:23:43.1139073Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.1139353Z 
2025-05-07T20:23:43.1139360Z 
2025-05-07T20:23:43.1226768Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.1227285Z 
2025-05-07T20:23:43.1332472Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.1332751Z 
2025-05-07T20:23:43.1486962Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.1503857Z conda-25.3.1         | 1.1 MB    | ########4  |  85% 
2025-05-07T20:23:43.3183194Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.3187849Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.3188191Z                                                      
2025-05-07T20:23:43.3188402Z 
2025-05-07T20:23:43.3188751Z                                                      [A
2025-05-07T20:23:43.3189002Z 
2025-05-07T20:23:43.3189008Z 
2025-05-07T20:23:43.3189256Z                                                      [A[A
2025-05-07T20:23:43.3189492Z 
2025-05-07T20:23:43.3189495Z 
2025-05-07T20:23:43.3189506Z 
2025-05-07T20:23:43.3189687Z                                                      [A[A[A done
2025-05-07T20:23:43.4192039Z Preparing transaction: - done
2025-05-07T20:23:43.5194425Z Verifying transaction: | done
2025-05-07T20:23:44.8212909Z Executing transaction: - \ | / - \ | / - \ | / - done
2025-05-07T20:23:46.5878273Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:23:46.5908121Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:23:47.5123049Z Channels:
2025-05-07T20:23:47.5123331Z  - defaults
2025-05-07T20:23:47.5123537Z Platform: linux-64
2025-05-07T20:23:48.7650537Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:23:48.8866929Z Solving environment: - \ Channels:
2025-05-07T20:23:48.8867404Z  - defaults
2025-05-07T20:23:48.8867716Z Platform: linux-64
2025-05-07T20:23:49.1752759Z Collecting package metadata (repodata.json): / - \ done
2025-05-07T20:23:49.3908901Z Solving environment: / - \ | done
2025-05-07T20:23:49.4736637Z done
2025-05-07T20:23:49.5396813Z 
2025-05-07T20:23:49.5397356Z ## Package Plan ##
2025-05-07T20:23:49.5397532Z 
2025-05-07T20:23:49.5397682Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:49.5397934Z 
2025-05-07T20:23:49.5398030Z   added / updated specs:
2025-05-07T20:23:49.5398294Z     - conda
2025-05-07T20:23:49.5398415Z 
2025-05-07T20:23:49.5398419Z 
2025-05-07T20:23:49.5398543Z The following packages will be downloaded:
2025-05-07T20:23:49.5398765Z 
2025-05-07T20:23:49.5398880Z     package                    |            build
2025-05-07T20:23:49.5399212Z     ---------------------------|-----------------
2025-05-07T20:23:49.5399576Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:23:49.5399977Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:23:49.5400357Z     ------------------------------------------------------------
2025-05-07T20:23:49.5400703Z                                            Total:         1.4 MB
2025-05-07T20:23:49.5400919Z 
2025-05-07T20:23:49.5401391Z The following packages will be UPDATED:
2025-05-07T20:23:49.5401609Z 
2025-05-07T20:23:49.5401925Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:23:49.5402456Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:23:49.5402724Z 
2025-05-07T20:23:49.5402728Z 
2025-05-07T20:23:49.5402732Z 
2025-05-07T20:23:49.5402874Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:49.5403250Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:23:49.5404360Z 
2025-05-07T20:23:49.5742283Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:23:49.6024941Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:49.6027162Z 
2025-05-07T20:23:49.7642725Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:49.7645092Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:49.8057339Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:49.8057794Z 
2025-05-07T20:23:49.8058264Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:49.8058709Z 
2025-05-07T20:23:49.8064558Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:49.8065615Z                                                      
2025-05-07T20:23:49.8065991Z 
2025-05-07T20:23:49.8066300Z                                                      [A done
2025-05-07T20:23:49.9071101Z Preparing transaction: - done
2025-05-07T20:23:50.0076825Z Verifying transaction: | done
2025-05-07T20:23:52.3114297Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:23:52.9227713Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:23:52.9232204Z + conda clean --packages --tarball -y
2025-05-07T20:23:52.9232421Z 
2025-05-07T20:23:53.9306751Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:23:53.9307099Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:23:53.9943624Z 
2025-05-07T20:23:53.9952990Z + conda clean --all -y
2025-05-07T20:23:53.9953171Z 
2025-05-07T20:23:54.5281322Z There are no unused tarball(s) to remove.
2025-05-07T20:23:54.5281721Z Will remove 1 index cache(s).
2025-05-07T20:23:54.5282007Z There are no unused package(s) to remove.
2025-05-07T20:23:54.5282330Z There are no tempfile(s) to remove.
2025-05-07T20:23:54.5282625Z There are no logfile(s) to remove.
2025-05-07T20:23:54.5916913Z 
2025-05-07T20:23:54.5921961Z + conda info
2025-05-07T20:23:54.5922132Z 
2025-05-07T20:23:55.3594468Z 
2025-05-07T20:23:55.3595035Z      active environment : base
2025-05-07T20:23:55.3595507Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:23:55.3595934Z             shell level : 1
2025-05-07T20:23:55.3596316Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:23:55.3596797Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:23:55.3597188Z           conda version : 25.3.1
2025-05-07T20:23:55.3597504Z     conda-build version : not installed
2025-05-07T20:23:55.3597806Z          python version : 3.13.2.final.0
2025-05-07T20:23:55.3598118Z                  solver : libmamba (default)
2025-05-07T20:23:55.3598449Z        virtual packages : __archspec=1=zen2
2025-05-07T20:23:55.3598757Z                           __conda=25.3.1=0
2025-05-07T20:23:55.3599041Z                           __cuda=12.8=0
2025-05-07T20:23:55.3599325Z                           __glibc=2.34=0
2025-05-07T20:23:55.3599611Z                           __linux=6.1.130=0
2025-05-07T20:23:55.3599887Z                           __unix=0=0
2025-05-07T20:23:55.3600233Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:23:55.3600658Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:23:55.3601020Z   conda av metadata url : None
2025-05-07T20:23:55.3601397Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:23:55.3602994Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:23:55.3603400Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:23:55.3603780Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:23:55.3604160Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:23:55.3604506Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:23:55.3604846Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:23:55.3605191Z                           /home/ec2-user/.conda/envs
2025-05-07T20:23:55.3605498Z                platform : linux-64
2025-05-07T20:23:55.3606385Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:23:55.3607264Z                 UID:GID : 1000:1000
2025-05-07T20:23:55.3607537Z              netrc file : None
2025-05-07T20:23:55.3607806Z            offline mode : False
2025-05-07T20:23:55.3607975Z 
2025-05-07T20:23:55.4258627Z 
2025-05-07T20:23:55.4258967Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:23:55.4260096Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_53de9a79-c4b6-4b66-9cfe-ac216a3e2536 ...
2025-05-07T20:23:55.4261578Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:23:55.4338043Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.9
2025-05-07T20:23:55.4338543Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.9[0m
2025-05-07T20:23:55.4357745Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:55.4358103Z env:
2025-05-07T20:23:55.4358319Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:55.4358625Z   BUILD_ENV: build_binary
2025-05-07T20:23:55.4358869Z   BUILD_TARGET: genai
2025-05-07T20:23:55.4359094Z   BUILD_VARIANT: cuda
2025-05-07T20:23:55.4359318Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:55.4359571Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:55.4359869Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:55.4360197Z ##[endgroup]
2025-05-07T20:23:55.7692407Z ################################################################################
2025-05-07T20:23:55.7692791Z # Create Conda Environment
2025-05-07T20:23:55.7693036Z #
2025-05-07T20:23:55.7709504Z # [2025-05-07T20:23:55.770Z] + create_conda_environment build_binary 3.9
2025-05-07T20:23:55.7710047Z ################################################################################
2025-05-07T20:23:55.7710274Z 
2025-05-07T20:23:55.7727882Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:55.8622947Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:55.8623760Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:23:55.8624408Z + conda info --envs
2025-05-07T20:23:55.8624696Z 
2025-05-07T20:23:56.6079366Z 
2025-05-07T20:23:56.6079728Z # conda environments:
2025-05-07T20:23:56.6079994Z #
2025-05-07T20:23:56.6080222Z base                   /home/ec2-user/miniconda
2025-05-07T20:23:56.6080466Z 
2025-05-07T20:23:56.6735475Z 
2025-05-07T20:23:56.6735813Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:23:58.3090274Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:23:58.3090592Z 
2025-05-07T20:23:58.3103426Z 
2025-05-07T20:23:58.3112814Z [SETUP] Creating new Conda environment (Python 3.9) ...
2025-05-07T20:23:58.3135521Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.9
2025-05-07T20:23:59.0633712Z Channels:
2025-05-07T20:23:59.0634057Z  - defaults
2025-05-07T20:23:59.0634336Z Platform: linux-64
2025-05-07T20:24:00.6098158Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done
2025-05-07T20:24:00.7101938Z Solving environment: / done
2025-05-07T20:24:00.7387715Z 
2025-05-07T20:24:00.7387946Z ## Package Plan ##
2025-05-07T20:24:00.7388094Z 
2025-05-07T20:24:00.7388581Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:00.7388989Z 
2025-05-07T20:24:00.7389145Z   added / updated specs:
2025-05-07T20:24:00.7389502Z     - python=3.9
2025-05-07T20:24:00.7389712Z 
2025-05-07T20:24:00.7389718Z 
2025-05-07T20:24:00.7390060Z The following packages will be downloaded:
2025-05-07T20:24:00.7390348Z 
2025-05-07T20:24:00.7390510Z     package                    |            build
2025-05-07T20:24:00.7390855Z     ---------------------------|-----------------
2025-05-07T20:24:00.7391228Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:00.7391643Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:00.7392071Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:00.7392503Z     python-3.9.21              |       he870216_1        25.1 MB
2025-05-07T20:24:00.7392920Z     setuptools-78.1.1          |   py39h06a4308_0         1.7 MB
2025-05-07T20:24:00.7393330Z     wheel-0.45.1               |   py39h06a4308_0         114 KB
2025-05-07T20:24:00.7393704Z     ------------------------------------------------------------
2025-05-07T20:24:00.7394049Z                                            Total:        27.1 MB
2025-05-07T20:24:00.7394667Z 
2025-05-07T20:24:00.7394807Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:00.7395033Z 
2025-05-07T20:24:00.7395438Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:00.7395897Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:00.7396426Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:00.7396986Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:00.7397457Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:00.7397900Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:00.7398352Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:00.7398833Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:00.7399295Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:00.7399732Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:00.7400154Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:00.7400570Z   python             pkgs/main/linux-64::python-3.9.21-he870216_1 
2025-05-07T20:24:00.7401003Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:00.7401493Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py39h06a4308_0 
2025-05-07T20:24:00.7401975Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:00.7402380Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:00.7402770Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:00.7403199Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py39h06a4308_0 
2025-05-07T20:24:00.7403606Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:00.7403983Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:00.7404251Z 
2025-05-07T20:24:00.7404255Z 
2025-05-07T20:24:00.7404259Z 
2025-05-07T20:24:00.7404404Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:00.7404796Z python-3.9.21        | 25.1 MB   |            |   0% 
2025-05-07T20:24:00.7405029Z 
2025-05-07T20:24:00.7405382Z setuptools-78.1.1    | 1.7 MB    |            |   0% [A
2025-05-07T20:24:00.7405638Z 
2025-05-07T20:24:00.7405641Z 
2025-05-07T20:24:00.7408975Z ca-certificates-2025 | 129 KB    |            |   0% [A[A
2025-05-07T20:24:00.7409272Z 
2025-05-07T20:24:00.7409276Z 
2025-05-07T20:24:00.7414262Z 
2025-05-07T20:24:00.7435566Z wheel-0.45.1         | 114 KB    |            |   0% [A[A[A
2025-05-07T20:24:00.7435842Z 
2025-05-07T20:24:00.7435846Z 
2025-05-07T20:24:00.7435850Z 
2025-05-07T20:24:00.7439820Z 
2025-05-07T20:24:00.7452635Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A[A[A[A
2025-05-07T20:24:00.7452991Z 
2025-05-07T20:24:00.7452995Z 
2025-05-07T20:24:00.7452999Z 
2025-05-07T20:24:00.7453003Z 
2025-05-07T20:24:00.7459492Z 
2025-05-07T20:24:00.7858345Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A[A[A
2025-05-07T20:24:00.7858676Z 
2025-05-07T20:24:00.7858683Z 
2025-05-07T20:24:00.7858939Z 
2025-05-07T20:24:00.7881717Z wheel-0.45.1         | 114 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:00.7882175Z 
2025-05-07T20:24:00.7882181Z 
2025-05-07T20:24:00.7882187Z 
2025-05-07T20:24:00.7882390Z 
2025-05-07T20:24:00.8215282Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:00.8215678Z 
2025-05-07T20:24:00.8215684Z 
2025-05-07T20:24:00.8215690Z 
2025-05-07T20:24:00.8215695Z 
2025-05-07T20:24:00.8215700Z 
2025-05-07T20:24:00.8276948Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:00.8277246Z 
2025-05-07T20:24:00.8280599Z 
2025-05-07T20:24:00.8394892Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:00.8406539Z python-3.9.21        | 25.1 MB   | 4          |   5% 
2025-05-07T20:24:00.8406866Z 
2025-05-07T20:24:00.8461175Z setuptools-78.1.1    | 1.7 MB    | 9          |   9% [A
2025-05-07T20:24:00.8461508Z 
2025-05-07T20:24:00.8461732Z 
2025-05-07T20:24:00.8461738Z 
2025-05-07T20:24:00.8461742Z 
2025-05-07T20:24:00.8464100Z 
2025-05-07T20:24:00.8822176Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:00.8822612Z 
2025-05-07T20:24:00.8822617Z 
2025-05-07T20:24:00.8822621Z 
2025-05-07T20:24:00.8822625Z 
2025-05-07T20:24:00.8829768Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:00.8830333Z 
2025-05-07T20:24:00.8830339Z 
2025-05-07T20:24:00.8830345Z 
2025-05-07T20:24:00.8834304Z 
2025-05-07T20:24:00.8901241Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:00.8901542Z 
2025-05-07T20:24:00.8901546Z 
2025-05-07T20:24:00.8905673Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:00.8906172Z 
2025-05-07T20:24:00.8906180Z 
2025-05-07T20:24:00.8931878Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:00.8932260Z 
2025-05-07T20:24:00.9221967Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:00.9222257Z 
2025-05-07T20:24:00.9222261Z 
2025-05-07T20:24:00.9223926Z 
2025-05-07T20:24:00.9227391Z wheel-0.45.1         | 114 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:00.9227651Z 
2025-05-07T20:24:00.9227656Z 
2025-05-07T20:24:00.9227988Z 
2025-05-07T20:24:00.9394929Z wheel-0.45.1         | 114 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.0396397Z python-3.9.21        | 25.1 MB   | ###3       |  34% 
2025-05-07T20:24:01.1034168Z python-3.9.21        | 25.1 MB   | #########5 |  95% 
2025-05-07T20:24:01.3537720Z python-3.9.21        | 25.1 MB   | ########## | 100% 
2025-05-07T20:24:01.3538082Z 
2025-05-07T20:24:01.3541339Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:01.3541729Z 
2025-05-07T20:24:01.8194566Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:01.8200718Z python-3.9.21        | 25.1 MB   | ########## | 100% 
2025-05-07T20:24:01.8201269Z                                                      
2025-05-07T20:24:01.8201481Z 
2025-05-07T20:24:01.8201696Z                                                      [A
2025-05-07T20:24:01.8201908Z 
2025-05-07T20:24:01.8201913Z 
2025-05-07T20:24:01.8202079Z                                                      [A[A
2025-05-07T20:24:01.8202300Z 
2025-05-07T20:24:01.8202304Z 
2025-05-07T20:24:01.8202307Z 
2025-05-07T20:24:01.8202480Z                                                      [A[A[A
2025-05-07T20:24:01.8202703Z 
2025-05-07T20:24:01.8202707Z 
2025-05-07T20:24:01.8202711Z 
2025-05-07T20:24:01.8202719Z 
2025-05-07T20:24:01.8202892Z                                                      [A[A[A[A
2025-05-07T20:24:01.8203127Z 
2025-05-07T20:24:01.8203131Z 
2025-05-07T20:24:01.8203134Z 
2025-05-07T20:24:01.8203148Z 
2025-05-07T20:24:01.8203152Z 
2025-05-07T20:24:01.8203330Z                                                      [A[A[A[A[A done
2025-05-07T20:24:02.0309166Z Preparing transaction: \ | done
2025-05-07T20:24:03.1643382Z Verifying transaction: - \ | / - \ | / - \ | done
2025-05-07T20:24:05.3827612Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:05.4331389Z #
2025-05-07T20:24:05.4331988Z # To activate this environment, use
2025-05-07T20:24:05.4332762Z #
2025-05-07T20:24:05.4333298Z #     $ conda activate build_binary
2025-05-07T20:24:05.4334008Z #
2025-05-07T20:24:05.4334421Z # To deactivate an active environment, use
2025-05-07T20:24:05.4334993Z #
2025-05-07T20:24:05.4335348Z #     $ conda deactivate
2025-05-07T20:24:05.4335746Z 
2025-05-07T20:24:05.5413569Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:05.5435415Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:08.3799329Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (25.1)
2025-05-07T20:24:08.3800251Z Collecting pip
2025-05-07T20:24:08.3801206Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:08.3801827Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:08.3802971Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 115.0 MB/s eta 0:00:00
2025-05-07T20:24:08.3803494Z Installing collected packages: pip
2025-05-07T20:24:08.3803938Z   Attempting uninstall: pip
2025-05-07T20:24:08.3804336Z     Found existing installation: pip 25.1
2025-05-07T20:24:08.3804792Z     Uninstalling pip-25.1:
2025-05-07T20:24:08.3805186Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:08.3805633Z Successfully installed pip-25.1.1
2025-05-07T20:24:08.3805915Z 
2025-05-07T20:24:08.4438367Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:08.4460792Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:09.2981726Z Channels:
2025-05-07T20:24:09.2982167Z  - conda-forge
2025-05-07T20:24:09.2982589Z Platform: linux-64
2025-05-07T20:24:19.7144754Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:21.2218952Z Solving environment: - \ | / - done
2025-05-07T20:24:21.2817289Z 
2025-05-07T20:24:21.2817809Z ## Package Plan ##
2025-05-07T20:24:21.2818035Z 
2025-05-07T20:24:21.2818342Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:21.2818759Z 
2025-05-07T20:24:21.2818859Z   added / updated specs:
2025-05-07T20:24:21.2819142Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:24:21.2819348Z 
2025-05-07T20:24:21.2819352Z 
2025-05-07T20:24:21.2819475Z The following packages will be downloaded:
2025-05-07T20:24:21.2819702Z 
2025-05-07T20:24:21.2819825Z     package                    |            build
2025-05-07T20:24:21.2820183Z     ---------------------------|-----------------
2025-05-07T20:24:21.2820582Z     cffi-1.17.1                |   py39h15c3d72_0         236 KB  conda-forge
2025-05-07T20:24:21.2821075Z     cryptography-44.0.3        |   py39h7170ec2_0         1.5 MB  conda-forge
2025-05-07T20:24:21.2821550Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:24:21.2821983Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:24:21.2822425Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:24:21.2822864Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:24:21.2823311Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:24:21.2823772Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:24:21.2824227Z     python_abi-3.9             |           2_cp39           4 KB  conda-forge
2025-05-07T20:24:21.2824714Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:24:21.2825229Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:24:21.2825684Z     ------------------------------------------------------------
2025-05-07T20:24:21.2826047Z                                            Total:         6.3 MB
2025-05-07T20:24:21.2826269Z 
2025-05-07T20:24:21.2826407Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:21.2826639Z 
2025-05-07T20:24:21.2826838Z   cffi               conda-forge/linux-64::cffi-1.17.1-py39h15c3d72_0 
2025-05-07T20:24:21.2827359Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py39h7170ec2_0 
2025-05-07T20:24:21.2827893Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:24:21.2828371Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:24:21.2828871Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:24:21.2829707Z   python_abi         conda-forge/linux-64::python_abi-3.9-2_cp39 
2025-05-07T20:24:21.2830401Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:24:21.2831187Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:24:21.2831554Z 
2025-05-07T20:24:21.2831676Z The following packages will be UPDATED:
2025-05-07T20:24:21.2831893Z 
2025-05-07T20:24:21.2832484Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:21.2833297Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:24:21.2833987Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:24:21.2834661Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:24:21.2835105Z 
2025-05-07T20:24:21.2835109Z 
2025-05-07T20:24:21.2835113Z 
2025-05-07T20:24:21.2835268Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:21.2835656Z openssl-3.5.0        | 3.0 MB    |            |   0% 
2025-05-07T20:24:21.2835907Z 
2025-05-07T20:24:21.2836294Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A
2025-05-07T20:24:21.2836564Z 
2025-05-07T20:24:21.2836567Z 
2025-05-07T20:24:21.2842446Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A
2025-05-07T20:24:21.2842699Z 
2025-05-07T20:24:21.2842703Z 
2025-05-07T20:24:21.2847134Z 
2025-05-07T20:24:21.2866541Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A
2025-05-07T20:24:21.2866801Z 
2025-05-07T20:24:21.2866977Z 
2025-05-07T20:24:21.2866980Z 
2025-05-07T20:24:21.2867109Z 
2025-05-07T20:24:21.2876928Z cffi-1.17.1          | 236 KB    |            |   0% [A[A[A[A
2025-05-07T20:24:21.2877273Z 
2025-05-07T20:24:21.2877277Z 
2025-05-07T20:24:21.2877281Z 
2025-05-07T20:24:21.2877300Z 
2025-05-07T20:24:21.2880486Z 
2025-05-07T20:24:21.2881928Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:21.2882292Z 
2025-05-07T20:24:21.2882302Z 
2025-05-07T20:24:21.2882313Z 
2025-05-07T20:24:21.2882326Z 
2025-05-07T20:24:21.2882329Z 
2025-05-07T20:24:21.2882333Z 
2025-05-07T20:24:21.2884769Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:21.2885171Z 
2025-05-07T20:24:21.2885188Z 
2025-05-07T20:24:21.2885193Z 
2025-05-07T20:24:21.2885198Z 
2025-05-07T20:24:21.2885203Z 
2025-05-07T20:24:21.2885208Z 
2025-05-07T20:24:21.2885213Z 
2025-05-07T20:24:21.2889120Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:21.2889559Z 
2025-05-07T20:24:21.2889565Z 
2025-05-07T20:24:21.2889570Z 
2025-05-07T20:24:21.2889575Z 
2025-05-07T20:24:21.2889580Z 
2025-05-07T20:24:21.2889585Z 
2025-05-07T20:24:21.2889590Z 
2025-05-07T20:24:21.2889595Z 
2025-05-07T20:24:21.2890147Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.2890584Z 
2025-05-07T20:24:21.2890595Z 
2025-05-07T20:24:21.2890600Z 
2025-05-07T20:24:21.2890606Z 
2025-05-07T20:24:21.2890611Z 
2025-05-07T20:24:21.2890616Z 
2025-05-07T20:24:21.2890628Z 
2025-05-07T20:24:21.2890633Z 
2025-05-07T20:24:21.2890638Z 
2025-05-07T20:24:21.2902983Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.2903388Z 
2025-05-07T20:24:21.2903394Z 
2025-05-07T20:24:21.2903399Z 
2025-05-07T20:24:21.2903404Z 
2025-05-07T20:24:21.2903409Z 
2025-05-07T20:24:21.2903414Z 
2025-05-07T20:24:21.2903419Z 
2025-05-07T20:24:21.2903425Z 
2025-05-07T20:24:21.2903430Z 
2025-05-07T20:24:21.2906606Z 
2025-05-07T20:24:21.3609893Z python_abi-3.9       | 4 KB      |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.3610283Z 
2025-05-07T20:24:21.3610289Z 
2025-05-07T20:24:21.3610294Z 
2025-05-07T20:24:21.3612245Z 
2025-05-07T20:24:21.3821158Z cffi-1.17.1          | 236 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:21.3854370Z openssl-3.5.0        | 3.0 MB    | ####2      |  43% 
2025-05-07T20:24:21.3854685Z 
2025-05-07T20:24:21.3855816Z 
2025-05-07T20:24:21.3895800Z libgcc-15.1.0        | 810 KB    | ##7        |  28% [A[A
2025-05-07T20:24:21.3896082Z 
2025-05-07T20:24:21.3896086Z 
2025-05-07T20:24:21.3896090Z 
2025-05-07T20:24:21.3901893Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:21.3902191Z 
2025-05-07T20:24:21.3902195Z 
2025-05-07T20:24:21.3903412Z 
2025-05-07T20:24:21.4049591Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:21.4050018Z 
2025-05-07T20:24:21.4050025Z 
2025-05-07T20:24:21.4050030Z 
2025-05-07T20:24:21.4050036Z 
2025-05-07T20:24:21.4050425Z 
2025-05-07T20:24:21.4120737Z pyopenssl-25.0.0     | 120 KB    | #3         |  13% [A[A[A[A[A
2025-05-07T20:24:21.4121187Z 
2025-05-07T20:24:21.4121193Z 
2025-05-07T20:24:21.4121198Z 
2025-05-07T20:24:21.4121204Z 
2025-05-07T20:24:21.4121227Z 
2025-05-07T20:24:21.4189124Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:21.4189429Z 
2025-05-07T20:24:21.4304500Z cryptography-44.0.3  | 1.5 MB    | 1          |   1% [A
2025-05-07T20:24:21.4304783Z 
2025-05-07T20:24:21.4304788Z 
2025-05-07T20:24:21.4482511Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:21.4483088Z 
2025-05-07T20:24:21.4483095Z 
2025-05-07T20:24:21.4483100Z 
2025-05-07T20:24:21.4483105Z 
2025-05-07T20:24:21.4483119Z 
2025-05-07T20:24:21.4484551Z 
2025-05-07T20:24:21.4524420Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:24:21.4524729Z 
2025-05-07T20:24:21.4524733Z 
2025-05-07T20:24:21.4524743Z 
2025-05-07T20:24:21.4524747Z 
2025-05-07T20:24:21.4524750Z 
2025-05-07T20:24:21.4527100Z 
2025-05-07T20:24:21.4736356Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:21.4736737Z 
2025-05-07T20:24:21.4736742Z 
2025-05-07T20:24:21.4736746Z 
2025-05-07T20:24:21.4736764Z 
2025-05-07T20:24:21.4736768Z 
2025-05-07T20:24:21.4736772Z 
2025-05-07T20:24:21.4740197Z 
2025-05-07T20:24:21.4853709Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:24:21.4854059Z 
2025-05-07T20:24:21.4854064Z 
2025-05-07T20:24:21.4854068Z 
2025-05-07T20:24:21.4854071Z 
2025-05-07T20:24:21.4854075Z 
2025-05-07T20:24:21.4854078Z 
2025-05-07T20:24:21.4858614Z 
2025-05-07T20:24:21.4957358Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:21.4959514Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:21.4981033Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:21.4981288Z 
2025-05-07T20:24:21.4981293Z 
2025-05-07T20:24:21.4981297Z 
2025-05-07T20:24:21.4981301Z 
2025-05-07T20:24:21.4981304Z 
2025-05-07T20:24:21.4981308Z 
2025-05-07T20:24:21.4981319Z 
2025-05-07T20:24:21.4981914Z 
2025-05-07T20:24:21.5027756Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.5028109Z 
2025-05-07T20:24:21.5028114Z 
2025-05-07T20:24:21.5028118Z 
2025-05-07T20:24:21.5033928Z 
2025-05-07T20:24:21.5037971Z cffi-1.17.1          | 236 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:21.5038245Z 
2025-05-07T20:24:21.5038250Z 
2025-05-07T20:24:21.5038254Z 
2025-05-07T20:24:21.5038260Z 
2025-05-07T20:24:21.5038311Z 
2025-05-07T20:24:21.5038317Z 
2025-05-07T20:24:21.5038320Z 
2025-05-07T20:24:21.5040616Z 
2025-05-07T20:24:21.5048871Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.5049226Z 
2025-05-07T20:24:21.5049230Z 
2025-05-07T20:24:21.5049234Z 
2025-05-07T20:24:21.5049969Z 
2025-05-07T20:24:21.5118227Z cffi-1.17.1          | 236 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:21.5118592Z 
2025-05-07T20:24:21.5118596Z 
2025-05-07T20:24:21.5118600Z 
2025-05-07T20:24:21.5118603Z 
2025-05-07T20:24:21.5118607Z 
2025-05-07T20:24:21.5118611Z 
2025-05-07T20:24:21.5118614Z 
2025-05-07T20:24:21.5118875Z 
2025-05-07T20:24:21.5118879Z 
2025-05-07T20:24:21.5131089Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.5131441Z 
2025-05-07T20:24:21.5131444Z 
2025-05-07T20:24:21.5131448Z 
2025-05-07T20:24:21.5131630Z 
2025-05-07T20:24:21.5131635Z 
2025-05-07T20:24:21.5141817Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:21.5142163Z 
2025-05-07T20:24:21.5142167Z 
2025-05-07T20:24:21.5142171Z 
2025-05-07T20:24:21.5142175Z 
2025-05-07T20:24:21.5142320Z 
2025-05-07T20:24:21.5154253Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:21.5154566Z 
2025-05-07T20:24:21.5154570Z 
2025-05-07T20:24:21.5154574Z 
2025-05-07T20:24:21.5154578Z 
2025-05-07T20:24:21.5154581Z 
2025-05-07T20:24:21.5154585Z 
2025-05-07T20:24:21.5154588Z 
2025-05-07T20:24:21.5154592Z 
2025-05-07T20:24:21.5158253Z 
2025-05-07T20:24:21.5189567Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.5193021Z 
2025-05-07T20:24:21.5196259Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:21.5199420Z 
2025-05-07T20:24:21.5298279Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:21.5298562Z 
2025-05-07T20:24:21.5298573Z 
2025-05-07T20:24:21.5298577Z 
2025-05-07T20:24:21.5298581Z 
2025-05-07T20:24:21.5298585Z 
2025-05-07T20:24:21.5298589Z 
2025-05-07T20:24:21.5298592Z 
2025-05-07T20:24:21.5298596Z 
2025-05-07T20:24:21.5298599Z 
2025-05-07T20:24:21.5298603Z 
2025-05-07T20:24:21.5314045Z python_abi-3.9       | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.5314348Z 
2025-05-07T20:24:21.5314352Z 
2025-05-07T20:24:21.5314357Z 
2025-05-07T20:24:21.5314361Z 
2025-05-07T20:24:21.5314365Z 
2025-05-07T20:24:21.5314368Z 
2025-05-07T20:24:21.5314372Z 
2025-05-07T20:24:21.5314376Z 
2025-05-07T20:24:21.5314379Z 
2025-05-07T20:24:21.5314383Z 
2025-05-07T20:24:21.5402186Z python_abi-3.9       | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.5402498Z 
2025-05-07T20:24:21.5402502Z 
2025-05-07T20:24:21.5402506Z 
2025-05-07T20:24:21.5621797Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:21.5622095Z 
2025-05-07T20:24:21.5622101Z 
2025-05-07T20:24:21.5622104Z 
2025-05-07T20:24:21.5622108Z 
2025-05-07T20:24:21.5622112Z 
2025-05-07T20:24:21.5622115Z 
2025-05-07T20:24:21.5622119Z 
2025-05-07T20:24:21.6014941Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:21.6015306Z 
2025-05-07T20:24:21.6015312Z 
2025-05-07T20:24:21.6018325Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:21.6018607Z 
2025-05-07T20:24:21.6018613Z 
2025-05-07T20:24:21.6189033Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:21.6189307Z 
2025-05-07T20:24:21.6189311Z 
2025-05-07T20:24:21.6189315Z 
2025-05-07T20:24:21.6189319Z 
2025-05-07T20:24:21.6189322Z 
2025-05-07T20:24:21.6189327Z 
2025-05-07T20:24:21.6189353Z 
2025-05-07T20:24:21.6189545Z 
2025-05-07T20:24:21.6193917Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.6194355Z 
2025-05-07T20:24:21.6194363Z 
2025-05-07T20:24:21.6194387Z 
2025-05-07T20:24:21.6194392Z 
2025-05-07T20:24:21.6194397Z 
2025-05-07T20:24:21.6194403Z 
2025-05-07T20:24:21.6194408Z 
2025-05-07T20:24:21.6194413Z 
2025-05-07T20:24:21.6404544Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.6404877Z 
2025-05-07T20:24:21.6404882Z 
2025-05-07T20:24:21.6404885Z 
2025-05-07T20:24:21.6404889Z 
2025-05-07T20:24:21.6404892Z 
2025-05-07T20:24:21.6404896Z 
2025-05-07T20:24:21.6407935Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:21.6408361Z 
2025-05-07T20:24:21.6408368Z 
2025-05-07T20:24:21.6408373Z 
2025-05-07T20:24:21.6408378Z 
2025-05-07T20:24:21.6408383Z 
2025-05-07T20:24:21.6408388Z 
2025-05-07T20:24:21.6522911Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:21.6523448Z 
2025-05-07T20:24:21.6523452Z 
2025-05-07T20:24:21.6523456Z 
2025-05-07T20:24:21.6523460Z 
2025-05-07T20:24:21.6523463Z 
2025-05-07T20:24:21.6523467Z 
2025-05-07T20:24:21.6523611Z 
2025-05-07T20:24:21.6523616Z 
2025-05-07T20:24:21.6523620Z 
2025-05-07T20:24:21.6525996Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.6526285Z 
2025-05-07T20:24:21.6526289Z 
2025-05-07T20:24:21.6526293Z 
2025-05-07T20:24:21.6526297Z 
2025-05-07T20:24:21.6526300Z 
2025-05-07T20:24:21.6526304Z 
2025-05-07T20:24:21.6526308Z 
2025-05-07T20:24:21.6526311Z 
2025-05-07T20:24:21.6526315Z 
2025-05-07T20:24:21.6617726Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.6618015Z 
2025-05-07T20:24:21.6618019Z 
2025-05-07T20:24:21.6618023Z 
2025-05-07T20:24:21.6618026Z 
2025-05-07T20:24:21.6618030Z 
2025-05-07T20:24:21.6618034Z 
2025-05-07T20:24:21.6618037Z 
2025-05-07T20:24:21.6618050Z 
2025-05-07T20:24:21.6618061Z 
2025-05-07T20:24:21.6618065Z 
2025-05-07T20:24:21.7365068Z python_abi-3.9       | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.7743468Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:21.7743724Z 
2025-05-07T20:24:21.7753902Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:21.7754694Z                                                      
2025-05-07T20:24:21.7754942Z 
2025-05-07T20:24:21.7755129Z                                                      [A
2025-05-07T20:24:21.7755354Z 
2025-05-07T20:24:21.7755358Z 
2025-05-07T20:24:21.7755524Z                                                      [A[A
2025-05-07T20:24:21.7755737Z 
2025-05-07T20:24:21.7755740Z 
2025-05-07T20:24:21.7755752Z 
2025-05-07T20:24:21.7755919Z                                                      [A[A[A
2025-05-07T20:24:21.7756132Z 
2025-05-07T20:24:21.7756136Z 
2025-05-07T20:24:21.7756139Z 
2025-05-07T20:24:21.7756152Z 
2025-05-07T20:24:21.7756326Z                                                      [A[A[A[A
2025-05-07T20:24:21.7756547Z 
2025-05-07T20:24:21.7756551Z 
2025-05-07T20:24:21.7756555Z 
2025-05-07T20:24:21.7756559Z 
2025-05-07T20:24:21.7756567Z 
2025-05-07T20:24:21.7756745Z                                                      [A[A[A[A[A
2025-05-07T20:24:21.7756964Z 
2025-05-07T20:24:21.7756968Z 
2025-05-07T20:24:21.7756971Z 
2025-05-07T20:24:21.7756975Z 
2025-05-07T20:24:21.7756979Z 
2025-05-07T20:24:21.7756982Z 
2025-05-07T20:24:21.7757163Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:21.7757386Z 
2025-05-07T20:24:21.7757390Z 
2025-05-07T20:24:21.7757394Z 
2025-05-07T20:24:21.7757397Z 
2025-05-07T20:24:21.7757401Z 
2025-05-07T20:24:21.7757404Z 
2025-05-07T20:24:21.7757408Z 
2025-05-07T20:24:21.7757592Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:21.7757815Z 
2025-05-07T20:24:21.7757818Z 
2025-05-07T20:24:21.7757827Z 
2025-05-07T20:24:21.7757830Z 
2025-05-07T20:24:21.7757834Z 
2025-05-07T20:24:21.7757837Z 
2025-05-07T20:24:21.7757841Z 
2025-05-07T20:24:21.7757845Z 
2025-05-07T20:24:21.7758036Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:21.7758261Z 
2025-05-07T20:24:21.7758264Z 
2025-05-07T20:24:21.7758268Z 
2025-05-07T20:24:21.7758271Z 
2025-05-07T20:24:21.7758275Z 
2025-05-07T20:24:21.7758279Z 
2025-05-07T20:24:21.7758282Z 
2025-05-07T20:24:21.7758286Z 
2025-05-07T20:24:21.7758289Z 
2025-05-07T20:24:21.7758477Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:21.7758704Z 
2025-05-07T20:24:21.7758707Z 
2025-05-07T20:24:21.7758711Z 
2025-05-07T20:24:21.7758714Z 
2025-05-07T20:24:21.7758718Z 
2025-05-07T20:24:21.7758721Z 
2025-05-07T20:24:21.7758725Z 
2025-05-07T20:24:21.7758736Z 
2025-05-07T20:24:21.7758739Z 
2025-05-07T20:24:21.7758743Z 
2025-05-07T20:24:21.7758944Z                                                      [A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:21.8759293Z Preparing transaction: | done
2025-05-07T20:24:21.9763615Z Verifying transaction: - done
2025-05-07T20:24:23.4791571Z Executing transaction: | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:23.6557607Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:24:25.3783443Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:24:25.3797067Z [SETUP] Installing libxcrypt ...
2025-05-07T20:24:25.3820045Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:24:26.2399315Z Channels:
2025-05-07T20:24:26.2399580Z  - conda-forge
2025-05-07T20:24:26.2399811Z Platform: linux-64
2025-05-07T20:24:29.6227038Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:29.9947480Z Solving environment: \ done
2025-05-07T20:24:30.0549133Z 
2025-05-07T20:24:30.0549568Z ## Package Plan ##
2025-05-07T20:24:30.0549721Z 
2025-05-07T20:24:30.0550104Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:30.0550421Z 
2025-05-07T20:24:30.0550598Z   added / updated specs:
2025-05-07T20:24:30.0550844Z     - libxcrypt
2025-05-07T20:24:30.0551004Z 
2025-05-07T20:24:30.0551009Z 
2025-05-07T20:24:30.0551127Z The following packages will be downloaded:
2025-05-07T20:24:30.0551347Z 
2025-05-07T20:24:30.0551469Z     package                    |            build
2025-05-07T20:24:30.0551800Z     ---------------------------|-----------------
2025-05-07T20:24:30.0552184Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:24:30.0552601Z     ------------------------------------------------------------
2025-05-07T20:24:30.0552947Z                                            Total:          98 KB
2025-05-07T20:24:30.0553162Z 
2025-05-07T20:24:30.0553286Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:30.0553515Z 
2025-05-07T20:24:30.0553738Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:24:30.0554046Z 
2025-05-07T20:24:30.0554050Z 
2025-05-07T20:24:30.0554053Z 
2025-05-07T20:24:30.0554193Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:30.3114727Z libxcrypt-4.4.36     | 98 KB     |            |   0% 
2025-05-07T20:24:30.3140214Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% 
2025-05-07T20:24:30.3240380Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:30.3242874Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:30.3243228Z                                                      
2025-05-07T20:24:30.3243513Z  done
2025-05-07T20:24:30.4248409Z Preparing transaction: / done
2025-05-07T20:24:30.5252651Z Verifying transaction: \ done
2025-05-07T20:24:30.6258305Z Executing transaction: / done
2025-05-07T20:24:34.0643844Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:24:34.0644589Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.9/crypt.h
2025-05-07T20:24:34.0645204Z 
2025-05-07T20:24:34.0672002Z 
2025-05-07T20:24:35.7024699Z [SETUP] Installed Python version: Python 3.9.21
2025-05-07T20:24:35.7025189Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:24:35.7057896Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:24:35.7058362Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:24:35.7070602Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:35.7070956Z env:
2025-05-07T20:24:35.7071185Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:35.7071515Z   BUILD_ENV: build_binary
2025-05-07T20:24:35.7071777Z   BUILD_TARGET: genai
2025-05-07T20:24:35.7072021Z   BUILD_VARIANT: cuda
2025-05-07T20:24:35.7072267Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:24:35.7072545Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:35.7072870Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:35.7073375Z ##[endgroup]
2025-05-07T20:24:36.0452921Z ################################################################################
2025-05-07T20:24:36.0453455Z # Install C/C++ Compilers
2025-05-07T20:24:36.0453799Z #
2025-05-07T20:24:36.0469763Z # [2025-05-07T20:24:36.046Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:24:36.0470491Z ################################################################################
2025-05-07T20:24:36.0470730Z 
2025-05-07T20:24:36.0485769Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:36.1409318Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:36.1420425Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:24:36.1443867Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:24:37.0077667Z Channels:
2025-05-07T20:24:37.0078285Z  - conda-forge
2025-05-07T20:24:37.0078631Z Platform: linux-64
2025-05-07T20:24:40.3344956Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:40.7047161Z Solving environment: \ done
2025-05-07T20:24:40.7658833Z 
2025-05-07T20:24:40.7659150Z ## Package Plan ##
2025-05-07T20:24:40.7659372Z 
2025-05-07T20:24:40.7659620Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:40.7659949Z 
2025-05-07T20:24:40.7660042Z   added / updated specs:
2025-05-07T20:24:40.7660313Z     - sysroot_linux-64=2.17
2025-05-07T20:24:40.7660481Z 
2025-05-07T20:24:40.7660485Z 
2025-05-07T20:24:40.7660614Z The following packages will be downloaded:
2025-05-07T20:24:40.7660835Z 
2025-05-07T20:24:40.7660957Z     package                    |            build
2025-05-07T20:24:40.7661280Z     ---------------------------|-----------------
2025-05-07T20:24:40.7661719Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:24:40.7662228Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:24:40.7662661Z     ------------------------------------------------------------
2025-05-07T20:24:40.7663016Z                                            Total:        15.4 MB
2025-05-07T20:24:40.7663240Z 
2025-05-07T20:24:40.7663367Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:40.7663602Z 
2025-05-07T20:24:40.7663902Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:24:40.7664491Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:24:40.7664821Z 
2025-05-07T20:24:40.7664825Z 
2025-05-07T20:24:40.7664829Z 
2025-05-07T20:24:40.7664971Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:40.7665359Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:40.7667946Z 
2025-05-07T20:24:40.8546297Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:24:40.8546822Z 
2025-05-07T20:24:40.9098372Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:41.0206646Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:41.0740934Z sysroot_linux-64-2.1 | 14.5 MB   | #####8     |  58% 
2025-05-07T20:24:41.0741434Z 
2025-05-07T20:24:41.0742006Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:41.0742963Z 
2025-05-07T20:24:41.1218912Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:41.1219349Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:41.6431607Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:41.6435725Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:41.6436218Z                                                      
2025-05-07T20:24:41.6436499Z 
2025-05-07T20:24:41.6437136Z                                                      [A done
2025-05-07T20:24:41.7443023Z Preparing transaction: / done
2025-05-07T20:24:41.9448695Z Verifying transaction: \ | done
2025-05-07T20:24:42.1490803Z Executing transaction: - \ done
2025-05-07T20:24:42.3014381Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:24:42.3014805Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:24:43.9875045Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:24:43.9888860Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:24:43.9912288Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:24:44.8777018Z Channels:
2025-05-07T20:24:44.8777293Z  - conda-forge
2025-05-07T20:24:44.8777535Z Platform: linux-64
2025-05-07T20:24:48.1994689Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:49.1634101Z Solving environment: \ | / done
2025-05-07T20:24:49.2271430Z 
2025-05-07T20:24:49.2271885Z ## Package Plan ##
2025-05-07T20:24:49.2272033Z 
2025-05-07T20:24:49.2272267Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:49.2272727Z 
2025-05-07T20:24:49.2272900Z   added / updated specs:
2025-05-07T20:24:49.2273223Z     - gxx_linux-64=11.4.0
2025-05-07T20:24:49.2273438Z 
2025-05-07T20:24:49.2273443Z 
2025-05-07T20:24:49.2273600Z The following packages will be downloaded:
2025-05-07T20:24:49.2273876Z 
2025-05-07T20:24:49.2274043Z     package                    |            build
2025-05-07T20:24:49.2274386Z     ---------------------------|-----------------
2025-05-07T20:24:49.2274910Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:24:49.2275572Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:24:49.2276242Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:24:49.2276775Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:24:49.2277247Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:24:49.2277716Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:24:49.2278191Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:24:49.2278685Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:24:49.2279199Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:24:49.2279673Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:24:49.2280178Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:24:49.2280803Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:24:49.2281391Z     ------------------------------------------------------------
2025-05-07T20:24:49.2281882Z                                            Total:        91.6 MB
2025-05-07T20:24:49.2282103Z 
2025-05-07T20:24:49.2282230Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:49.2282468Z 
2025-05-07T20:24:49.2283103Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:24:49.2283930Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:24:49.2284880Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:24:49.2285419Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:24:49.2285948Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:24:49.2286480Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:24:49.2287037Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:49.2287621Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:24:49.2288143Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:24:49.2288717Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:49.2289242Z 
2025-05-07T20:24:49.2289365Z The following packages will be UPDATED:
2025-05-07T20:24:49.2289577Z 
2025-05-07T20:24:49.2289914Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:24:49.2290831Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:24:49.2291421Z 
2025-05-07T20:24:49.2291439Z 
2025-05-07T20:24:49.2291444Z 
2025-05-07T20:24:49.2291617Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:49.2292023Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:49.2292261Z 
2025-05-07T20:24:49.2292520Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:49.2292774Z 
2025-05-07T20:24:49.2292778Z 
2025-05-07T20:24:49.2293003Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:49.2293279Z 
2025-05-07T20:24:49.2293290Z 
2025-05-07T20:24:49.2293294Z 
2025-05-07T20:24:49.2310440Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:49.2310851Z 
2025-05-07T20:24:49.2310857Z 
2025-05-07T20:24:49.2310862Z 
2025-05-07T20:24:49.2317219Z 
2025-05-07T20:24:49.2330606Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:49.2331002Z 
2025-05-07T20:24:49.2331021Z 
2025-05-07T20:24:49.2331026Z 
2025-05-07T20:24:49.2331031Z 
2025-05-07T20:24:49.2331044Z 
2025-05-07T20:24:49.2331871Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:49.2332278Z 
2025-05-07T20:24:49.2332283Z 
2025-05-07T20:24:49.2332289Z 
2025-05-07T20:24:49.2332305Z 
2025-05-07T20:24:49.2332311Z 
2025-05-07T20:24:49.2332323Z 
2025-05-07T20:24:49.2333462Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:49.2333865Z 
2025-05-07T20:24:49.2333871Z 
2025-05-07T20:24:49.2333888Z 
2025-05-07T20:24:49.2333900Z 
2025-05-07T20:24:49.2333918Z 
2025-05-07T20:24:49.2333923Z 
2025-05-07T20:24:49.2333929Z 
2025-05-07T20:24:49.2334945Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:49.2335340Z 
2025-05-07T20:24:49.2335356Z 
2025-05-07T20:24:49.2335362Z 
2025-05-07T20:24:49.2335376Z 
2025-05-07T20:24:49.2335381Z 
2025-05-07T20:24:49.2335386Z 
2025-05-07T20:24:49.2335392Z 
2025-05-07T20:24:49.2335397Z 
2025-05-07T20:24:49.2336433Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:49.2336849Z 
2025-05-07T20:24:49.2336866Z 
2025-05-07T20:24:49.2336872Z 
2025-05-07T20:24:49.2336877Z 
2025-05-07T20:24:49.2336882Z 
2025-05-07T20:24:49.2336887Z 
2025-05-07T20:24:49.2336892Z 
2025-05-07T20:24:49.2336897Z 
2025-05-07T20:24:49.2336902Z 
2025-05-07T20:24:49.2346870Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.2347293Z 
2025-05-07T20:24:49.2347298Z 
2025-05-07T20:24:49.2347303Z 
2025-05-07T20:24:49.2347309Z 
2025-05-07T20:24:49.2347323Z 
2025-05-07T20:24:49.2347329Z 
2025-05-07T20:24:49.2347334Z 
2025-05-07T20:24:49.2347339Z 
2025-05-07T20:24:49.2347344Z 
2025-05-07T20:24:49.2355324Z 
2025-05-07T20:24:49.2356503Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.2356919Z 
2025-05-07T20:24:49.2356924Z 
2025-05-07T20:24:49.2356929Z 
2025-05-07T20:24:49.2356934Z 
2025-05-07T20:24:49.2356940Z 
2025-05-07T20:24:49.2356945Z 
2025-05-07T20:24:49.2356950Z 
2025-05-07T20:24:49.2356955Z 
2025-05-07T20:24:49.2356967Z 
2025-05-07T20:24:49.2356972Z 
2025-05-07T20:24:49.2356977Z 
2025-05-07T20:24:49.3279291Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.3281106Z 
2025-05-07T20:24:49.3288892Z gxx_impl_linux-64-11 | 11.2 MB   | 2          |   2% [A
2025-05-07T20:24:49.3289272Z 
2025-05-07T20:24:49.3289278Z 
2025-05-07T20:24:49.3300164Z libstdcxx-devel_linu | 11.1 MB   |            |   1% [A[A
2025-05-07T20:24:49.3301053Z 
2025-05-07T20:24:49.3301059Z 
2025-05-07T20:24:49.3301062Z 
2025-05-07T20:24:49.3326681Z binutils_impl_linux- | 6.0 MB    | 6          |   7% [A[A[A
2025-05-07T20:24:49.3327060Z 
2025-05-07T20:24:49.3327065Z 
2025-05-07T20:24:49.3327068Z 
2025-05-07T20:24:49.3327072Z 
2025-05-07T20:24:49.3633784Z libstdcxx-15.1.0     | 3.7 MB    | #2         |  13% [A[A[A[A
2025-05-07T20:24:49.4280617Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:49.4281587Z 
2025-05-07T20:24:49.4294509Z gxx_impl_linux-64-11 | 11.2 MB   | ####6      |  47% [A
2025-05-07T20:24:49.4294773Z 
2025-05-07T20:24:49.4294777Z 
2025-05-07T20:24:49.4301171Z libstdcxx-devel_linu | 11.1 MB   | #9         |  20% [A[A
2025-05-07T20:24:49.4301585Z 
2025-05-07T20:24:49.4301592Z 
2025-05-07T20:24:49.4301597Z 
2025-05-07T20:24:49.4641018Z binutils_impl_linux- | 6.0 MB    | #######    |  71% [A[A[A
2025-05-07T20:24:49.4644331Z gcc_impl_linux-64-11 | 53.0 MB   | 7          |   8% 
2025-05-07T20:24:49.4644681Z 
2025-05-07T20:24:49.4644702Z 
2025-05-07T20:24:49.4644708Z 
2025-05-07T20:24:49.4647757Z 
2025-05-07T20:24:49.4654404Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:49.4654760Z 
2025-05-07T20:24:49.4654764Z 
2025-05-07T20:24:49.4654768Z 
2025-05-07T20:24:49.4656074Z 
2025-05-07T20:24:49.5083855Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:49.5084141Z 
2025-05-07T20:24:49.5084145Z 
2025-05-07T20:24:49.5084148Z 
2025-05-07T20:24:49.5084152Z 
2025-05-07T20:24:49.5085798Z 
2025-05-07T20:24:49.5281926Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:49.5284175Z 
2025-05-07T20:24:49.5298638Z gxx_impl_linux-64-11 | 11.2 MB   | #######9   |  79% [A
2025-05-07T20:24:49.5299029Z 
2025-05-07T20:24:49.5302169Z 
2025-05-07T20:24:49.5644288Z libstdcxx-devel_linu | 11.1 MB   | ####4      |  45% [A[A
2025-05-07T20:24:49.6087003Z gcc_impl_linux-64-11 | 53.0 MB   | #2         |  13% 
2025-05-07T20:24:49.6087365Z 
2025-05-07T20:24:49.6087392Z 
2025-05-07T20:24:49.6087398Z 
2025-05-07T20:24:49.6087415Z 
2025-05-07T20:24:49.6087422Z 
2025-05-07T20:24:49.6305113Z libsanitizer-11.4.0  | 3.5 MB    | #########5 |  96% [A[A[A[A[A
2025-05-07T20:24:49.6305493Z 
2025-05-07T20:24:49.6305497Z 
2025-05-07T20:24:49.6473462Z libstdcxx-devel_linu | 11.1 MB   | #######1   |  72% [A[A
2025-05-07T20:24:49.6473850Z 
2025-05-07T20:24:49.6473856Z 
2025-05-07T20:24:49.6476592Z 
2025-05-07T20:24:49.6645537Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:49.6911350Z gcc_impl_linux-64-11 | 53.0 MB   | #9         |  20% 
2025-05-07T20:24:49.6911706Z 
2025-05-07T20:24:49.6911713Z 
2025-05-07T20:24:49.6911718Z 
2025-05-07T20:24:49.6911723Z 
2025-05-07T20:24:49.6911728Z 
2025-05-07T20:24:49.6911733Z 
2025-05-07T20:24:49.7288600Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:24:49.7289028Z 
2025-05-07T20:24:49.7289033Z 
2025-05-07T20:24:49.7289038Z 
2025-05-07T20:24:49.7289066Z 
2025-05-07T20:24:49.7289071Z 
2025-05-07T20:24:49.7313520Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:49.7313935Z 
2025-05-07T20:24:49.7313941Z 
2025-05-07T20:24:49.7647253Z libstdcxx-devel_linu | 11.1 MB   | #########8 |  99% [A[A
2025-05-07T20:24:49.7715735Z gcc_impl_linux-64-11 | 53.0 MB   | ##5        |  26% 
2025-05-07T20:24:49.7716037Z 
2025-05-07T20:24:49.7716041Z 
2025-05-07T20:24:49.7716045Z 
2025-05-07T20:24:49.7716048Z 
2025-05-07T20:24:49.7716052Z 
2025-05-07T20:24:49.7716056Z 
2025-05-07T20:24:49.7716059Z 
2025-05-07T20:24:49.8157775Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:24:49.8158117Z 
2025-05-07T20:24:49.8158121Z 
2025-05-07T20:24:49.8158125Z 
2025-05-07T20:24:49.8158129Z 
2025-05-07T20:24:49.8158133Z 
2025-05-07T20:24:49.8158137Z 
2025-05-07T20:24:49.8160018Z 
2025-05-07T20:24:49.8527982Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:49.8528350Z 
2025-05-07T20:24:49.8528633Z 
2025-05-07T20:24:49.8528636Z 
2025-05-07T20:24:49.8528640Z 
2025-05-07T20:24:49.8528643Z 
2025-05-07T20:24:49.8528647Z 
2025-05-07T20:24:49.8528651Z 
2025-05-07T20:24:49.8528654Z 
2025-05-07T20:24:49.8551010Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:24:49.8551373Z 
2025-05-07T20:24:49.8551378Z 
2025-05-07T20:24:49.8551383Z 
2025-05-07T20:24:49.8551388Z 
2025-05-07T20:24:49.8551393Z 
2025-05-07T20:24:49.8553149Z 
2025-05-07T20:24:49.8559903Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:49.8560291Z 
2025-05-07T20:24:49.8560295Z 
2025-05-07T20:24:49.8560298Z 
2025-05-07T20:24:49.8560302Z 
2025-05-07T20:24:49.8560306Z 
2025-05-07T20:24:49.8562302Z 
2025-05-07T20:24:49.8580314Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:49.8580737Z 
2025-05-07T20:24:49.8580743Z 
2025-05-07T20:24:49.8580748Z 
2025-05-07T20:24:49.8580753Z 
2025-05-07T20:24:49.8580758Z 
2025-05-07T20:24:49.8580773Z 
2025-05-07T20:24:49.8580778Z 
2025-05-07T20:24:49.8583941Z 
2025-05-07T20:24:49.8649377Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:49.9023169Z gcc_impl_linux-64-11 | 53.0 MB   | ###1       |  31% 
2025-05-07T20:24:49.9023480Z 
2025-05-07T20:24:49.9023486Z 
2025-05-07T20:24:49.9023491Z 
2025-05-07T20:24:49.9023496Z 
2025-05-07T20:24:49.9023502Z 
2025-05-07T20:24:49.9023507Z 
2025-05-07T20:24:49.9023512Z 
2025-05-07T20:24:49.9023517Z 
2025-05-07T20:24:49.9023522Z 
2025-05-07T20:24:49.9063659Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.9064067Z 
2025-05-07T20:24:49.9064072Z 
2025-05-07T20:24:49.9064076Z 
2025-05-07T20:24:49.9064079Z 
2025-05-07T20:24:49.9064083Z 
2025-05-07T20:24:49.9064086Z 
2025-05-07T20:24:49.9064098Z 
2025-05-07T20:24:49.9064102Z 
2025-05-07T20:24:49.9064564Z 
2025-05-07T20:24:49.9213773Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.9214116Z 
2025-05-07T20:24:49.9214121Z 
2025-05-07T20:24:49.9214125Z 
2025-05-07T20:24:49.9214129Z 
2025-05-07T20:24:49.9214132Z 
2025-05-07T20:24:49.9214136Z 
2025-05-07T20:24:49.9214140Z 
2025-05-07T20:24:49.9214143Z 
2025-05-07T20:24:49.9214156Z 
2025-05-07T20:24:49.9215543Z 
2025-05-07T20:24:49.9271250Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.9271580Z 
2025-05-07T20:24:49.9271584Z 
2025-05-07T20:24:49.9271587Z 
2025-05-07T20:24:49.9271591Z 
2025-05-07T20:24:49.9271595Z 
2025-05-07T20:24:49.9271598Z 
2025-05-07T20:24:49.9271602Z 
2025-05-07T20:24:49.9271606Z 
2025-05-07T20:24:49.9271609Z 
2025-05-07T20:24:49.9272804Z 
2025-05-07T20:24:49.9505126Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.9505558Z 
2025-05-07T20:24:49.9505562Z 
2025-05-07T20:24:49.9505566Z 
2025-05-07T20:24:49.9505570Z 
2025-05-07T20:24:49.9505574Z 
2025-05-07T20:24:49.9505577Z 
2025-05-07T20:24:49.9505599Z 
2025-05-07T20:24:49.9505603Z 
2025-05-07T20:24:49.9505607Z 
2025-05-07T20:24:49.9505610Z 
2025-05-07T20:24:49.9505614Z 
2025-05-07T20:24:49.9545023Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.9545735Z 
2025-05-07T20:24:49.9545744Z 
2025-05-07T20:24:49.9545749Z 
2025-05-07T20:24:49.9545755Z 
2025-05-07T20:24:49.9545760Z 
2025-05-07T20:24:49.9545765Z 
2025-05-07T20:24:49.9545770Z 
2025-05-07T20:24:49.9545776Z 
2025-05-07T20:24:49.9545781Z 
2025-05-07T20:24:49.9545786Z 
2025-05-07T20:24:49.9549442Z 
2025-05-07T20:24:49.9611547Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:49.9612067Z 
2025-05-07T20:24:49.9612076Z 
2025-05-07T20:24:49.9612083Z 
2025-05-07T20:24:49.9613996Z 
2025-05-07T20:24:49.9652288Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:49.9796258Z gcc_impl_linux-64-11 | 53.0 MB   | ###8       |  39% 
2025-05-07T20:24:49.9798849Z 
2025-05-07T20:24:50.0099156Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:50.0099451Z 
2025-05-07T20:24:50.0099455Z 
2025-05-07T20:24:50.0177319Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:50.0177725Z 
2025-05-07T20:24:50.0177766Z 
2025-05-07T20:24:50.0177770Z 
2025-05-07T20:24:50.0177774Z 
2025-05-07T20:24:50.0177777Z 
2025-05-07T20:24:50.0177781Z 
2025-05-07T20:24:50.0177785Z 
2025-05-07T20:24:50.0180278Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:50.0180594Z 
2025-05-07T20:24:50.0180598Z 
2025-05-07T20:24:50.0180602Z 
2025-05-07T20:24:50.0180606Z 
2025-05-07T20:24:50.0180609Z 
2025-05-07T20:24:50.0180613Z 
2025-05-07T20:24:50.0180765Z 
2025-05-07T20:24:50.0798714Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:50.0799214Z 
2025-05-07T20:24:50.0799221Z 
2025-05-07T20:24:50.0799226Z 
2025-05-07T20:24:50.0799231Z 
2025-05-07T20:24:50.0799269Z 
2025-05-07T20:24:50.0874107Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:50.0874577Z 
2025-05-07T20:24:50.0874585Z 
2025-05-07T20:24:50.0874590Z 
2025-05-07T20:24:50.0874596Z 
2025-05-07T20:24:50.0874601Z 
2025-05-07T20:24:50.0874629Z 
2025-05-07T20:24:50.1409805Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:50.1410158Z 
2025-05-07T20:24:50.1410163Z 
2025-05-07T20:24:50.1410167Z 
2025-05-07T20:24:50.1410170Z 
2025-05-07T20:24:50.1410174Z 
2025-05-07T20:24:50.1410178Z 
2025-05-07T20:24:50.1410182Z 
2025-05-07T20:24:50.1410185Z 
2025-05-07T20:24:50.1410856Z 
2025-05-07T20:24:50.1414005Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.1414343Z 
2025-05-07T20:24:50.1414347Z 
2025-05-07T20:24:50.1414350Z 
2025-05-07T20:24:50.1414354Z 
2025-05-07T20:24:50.1414358Z 
2025-05-07T20:24:50.1414361Z 
2025-05-07T20:24:50.1414365Z 
2025-05-07T20:24:50.1414368Z 
2025-05-07T20:24:50.1419818Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.1420212Z 
2025-05-07T20:24:50.1420218Z 
2025-05-07T20:24:50.1420223Z 
2025-05-07T20:24:50.1420229Z 
2025-05-07T20:24:50.1420234Z 
2025-05-07T20:24:50.1420244Z 
2025-05-07T20:24:50.1420265Z 
2025-05-07T20:24:50.1420271Z 
2025-05-07T20:24:50.1421566Z 
2025-05-07T20:24:50.1425744Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.1426175Z 
2025-05-07T20:24:50.1426179Z 
2025-05-07T20:24:50.1426183Z 
2025-05-07T20:24:50.1426186Z 
2025-05-07T20:24:50.1426190Z 
2025-05-07T20:24:50.1426202Z 
2025-05-07T20:24:50.1426206Z 
2025-05-07T20:24:50.1429194Z 
2025-05-07T20:24:50.1588281Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:50.2089575Z gcc_impl_linux-64-11 | 53.0 MB   | ####4      |  45% 
2025-05-07T20:24:50.2089992Z 
2025-05-07T20:24:50.2089998Z 
2025-05-07T20:24:50.2090003Z 
2025-05-07T20:24:50.2090009Z 
2025-05-07T20:24:50.2090050Z 
2025-05-07T20:24:50.2090056Z 
2025-05-07T20:24:50.2090062Z 
2025-05-07T20:24:50.2090067Z 
2025-05-07T20:24:50.2090073Z 
2025-05-07T20:24:50.2090079Z 
2025-05-07T20:24:50.2090085Z 
2025-05-07T20:24:50.2096745Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.2097104Z 
2025-05-07T20:24:50.2097109Z 
2025-05-07T20:24:50.2097112Z 
2025-05-07T20:24:50.2097116Z 
2025-05-07T20:24:50.2097120Z 
2025-05-07T20:24:50.2097123Z 
2025-05-07T20:24:50.2097127Z 
2025-05-07T20:24:50.2097139Z 
2025-05-07T20:24:50.2097143Z 
2025-05-07T20:24:50.2097147Z 
2025-05-07T20:24:50.2097150Z 
2025-05-07T20:24:50.2151021Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.2151368Z 
2025-05-07T20:24:50.2151373Z 
2025-05-07T20:24:50.2151377Z 
2025-05-07T20:24:50.2151381Z 
2025-05-07T20:24:50.2151385Z 
2025-05-07T20:24:50.2151389Z 
2025-05-07T20:24:50.2151392Z 
2025-05-07T20:24:50.2151396Z 
2025-05-07T20:24:50.2151599Z 
2025-05-07T20:24:50.2151603Z 
2025-05-07T20:24:50.2155812Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.2156135Z 
2025-05-07T20:24:50.2156140Z 
2025-05-07T20:24:50.2156143Z 
2025-05-07T20:24:50.2156147Z 
2025-05-07T20:24:50.2156161Z 
2025-05-07T20:24:50.2156165Z 
2025-05-07T20:24:50.2156168Z 
2025-05-07T20:24:50.2156172Z 
2025-05-07T20:24:50.2156176Z 
2025-05-07T20:24:50.2156326Z 
2025-05-07T20:24:50.2590377Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:50.3591270Z gcc_impl_linux-64-11 | 53.0 MB   | #####4     |  54% 
2025-05-07T20:24:50.4013926Z gcc_impl_linux-64-11 | 53.0 MB   | ######4    |  65% 
2025-05-07T20:24:50.4014338Z 
2025-05-07T20:24:50.4014347Z 
2025-05-07T20:24:50.4015154Z 
2025-05-07T20:24:50.4591559Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:50.5592188Z gcc_impl_linux-64-11 | 53.0 MB   | ########   |  80% 
2025-05-07T20:24:50.6351369Z gcc_impl_linux-64-11 | 53.0 MB   | #########8 |  98% 
2025-05-07T20:24:50.6351792Z 
2025-05-07T20:24:50.6783944Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:50.9462285Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:50.9462853Z 
2025-05-07T20:24:50.9462863Z 
2025-05-07T20:24:51.3906593Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:51.3912263Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:51.3912617Z                                                      
2025-05-07T20:24:51.3912831Z 
2025-05-07T20:24:51.3913047Z                                                      [A
2025-05-07T20:24:51.3913258Z 
2025-05-07T20:24:51.3913273Z 
2025-05-07T20:24:51.3913445Z                                                      [A[A
2025-05-07T20:24:51.3913661Z 
2025-05-07T20:24:51.3913664Z 
2025-05-07T20:24:51.3913668Z 
2025-05-07T20:24:51.3913839Z                                                      [A[A[A
2025-05-07T20:24:51.3914085Z 
2025-05-07T20:24:51.3914088Z 
2025-05-07T20:24:51.3914092Z 
2025-05-07T20:24:51.3914095Z 
2025-05-07T20:24:51.3914267Z                                                      [A[A[A[A
2025-05-07T20:24:51.3914492Z 
2025-05-07T20:24:51.3914496Z 
2025-05-07T20:24:51.3914512Z 
2025-05-07T20:24:51.3914517Z 
2025-05-07T20:24:51.3914521Z 
2025-05-07T20:24:51.3914694Z                                                      [A[A[A[A[A
2025-05-07T20:24:51.3914923Z 
2025-05-07T20:24:51.3914927Z 
2025-05-07T20:24:51.3914930Z 
2025-05-07T20:24:51.3914934Z 
2025-05-07T20:24:51.3914938Z 
2025-05-07T20:24:51.3914941Z 
2025-05-07T20:24:51.3915115Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:51.3915344Z 
2025-05-07T20:24:51.3915348Z 
2025-05-07T20:24:51.3915352Z 
2025-05-07T20:24:51.3915355Z 
2025-05-07T20:24:51.3915359Z 
2025-05-07T20:24:51.3915362Z 
2025-05-07T20:24:51.3915366Z 
2025-05-07T20:24:51.3915553Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:51.3915790Z 
2025-05-07T20:24:51.3915794Z 
2025-05-07T20:24:51.3915798Z 
2025-05-07T20:24:51.3915801Z 
2025-05-07T20:24:51.3915805Z 
2025-05-07T20:24:51.3915808Z 
2025-05-07T20:24:51.3915812Z 
2025-05-07T20:24:51.3915816Z 
2025-05-07T20:24:51.3916241Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:51.3916478Z 
2025-05-07T20:24:51.3916482Z 
2025-05-07T20:24:51.3916486Z 
2025-05-07T20:24:51.3916489Z 
2025-05-07T20:24:51.3916493Z 
2025-05-07T20:24:51.3916496Z 
2025-05-07T20:24:51.3916500Z 
2025-05-07T20:24:51.3916504Z 
2025-05-07T20:24:51.3916507Z 
2025-05-07T20:24:51.3916699Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.3916934Z 
2025-05-07T20:24:51.3916937Z 
2025-05-07T20:24:51.3916941Z 
2025-05-07T20:24:51.3916945Z 
2025-05-07T20:24:51.3916948Z 
2025-05-07T20:24:51.3916952Z 
2025-05-07T20:24:51.3916956Z 
2025-05-07T20:24:51.3916959Z 
2025-05-07T20:24:51.3916963Z 
2025-05-07T20:24:51.3917104Z 
2025-05-07T20:24:51.3917303Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.3917536Z 
2025-05-07T20:24:51.3917540Z 
2025-05-07T20:24:51.3917543Z 
2025-05-07T20:24:51.3917547Z 
2025-05-07T20:24:51.3917556Z 
2025-05-07T20:24:51.3917560Z 
2025-05-07T20:24:51.3917563Z 
2025-05-07T20:24:51.3917567Z 
2025-05-07T20:24:51.3917571Z 
2025-05-07T20:24:51.3917574Z 
2025-05-07T20:24:51.3917578Z 
2025-05-07T20:24:51.3917785Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:51.4919965Z Preparing transaction: \ done
2025-05-07T20:24:51.7935910Z Verifying transaction: / - \ done
2025-05-07T20:24:51.8950949Z Executing transaction: / done
2025-05-07T20:24:52.0610101Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:24:55.9395758Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:55.9396389Z 
2025-05-07T20:24:55.9410190Z 
2025-05-07T20:24:55.9428417Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:24:55.9428998Z 
2025-05-07T20:24:55.9440918Z 
2025-05-07T20:24:55.9458228Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:24:55.9458820Z 
2025-05-07T20:24:55.9471421Z 
2025-05-07T20:24:55.9489409Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:24:55.9489995Z 
2025-05-07T20:24:55.9501215Z 
2025-05-07T20:24:57.8349368Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:57.8349648Z 
2025-05-07T20:24:57.8979704Z [CHECK] Binary cc found in PATH
2025-05-07T20:24:59.7780568Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:24:59.7780874Z 
2025-05-07T20:24:59.8405093Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:01.7245961Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:01.7246427Z 
2025-05-07T20:25:01.7885838Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:03.6698603Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:03.6698892Z 
2025-05-07T20:25:03.7321627Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:03.7326290Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:03.7326857Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:03.7327099Z 
2025-05-07T20:25:05.6140840Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:05.6141311Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:05.6141702Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:05.6142053Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:05.6142463Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:05.6142842Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:05.6143137Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:05.6143495Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:05.6143873Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:05.6144226Z #define __CHAR_BIT__ 8
2025-05-07T20:25:05.6144538Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:05.6145220Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:05.6145490Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:05.6145770Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:05.6146051Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:05.6146357Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6146659Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:05.6146957Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:05.6147292Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:05.6147633Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:05.6148054Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:05.6148494Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:05.6148979Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:05.6149268Z #define __GCC_IEC_559 2
2025-05-07T20:25:05.6149525Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:05.6149939Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:05.6150215Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:05.6150501Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:05.6150841Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6151170Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:05.6151448Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:05.6151728Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:05.6152000Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:05.6152263Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:05.6152529Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:05.6152797Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:05.6153055Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:05.6153310Z #define __INT8_C(c) c
2025-05-07T20:25:05.6153556Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:05.6153854Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6154189Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:05.6154517Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:05.6154885Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:05.6155168Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:05.6155436Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6155715Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:05.6155996Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:05.6156406Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:05.6156845Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:05.6157137Z #define __linux 1
2025-05-07T20:25:05.6157365Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:05.6157650Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:05.6157930Z #define __unix 1
2025-05-07T20:25:05.6158163Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:05.6158446Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:05.6158720Z #define __WINT_MIN__ 0U
2025-05-07T20:25:05.6158966Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:05.6159261Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:05.6159532Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:05.6159808Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:05.6160061Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:05.6160349Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:05.6160655Z #define __INT64_C(c) c ## L
2025-05-07T20:25:05.6160921Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:05.6161227Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:05.6161487Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:05.6161851Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:05.6162247Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:05.6162498Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:05.6162771Z #define __DBL_DIG__ 15
2025-05-07T20:25:05.6163002Z #define __FLT32_DIG__ 6
2025-05-07T20:25:05.6163307Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:05.6163675Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:05.6164026Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:05.6164356Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:05.6164724Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:05.6164982Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:05.6165245Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:05.6165635Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:05.6166064Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:05.6166354Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:05.6166614Z #define __unix__ 1
2025-05-07T20:25:05.6166843Z #define __INT_WIDTH__ 32
2025-05-07T20:25:05.6167096Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:05.6167346Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:05.6167684Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:05.6167952Z #define __UINT16_C(c) c
2025-05-07T20:25:05.6168181Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:05.6168436Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:05.6168809Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:05.6169183Z #define __gnu_linux__ 1
2025-05-07T20:25:05.6169427Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:05.6169707Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:05.6169999Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6170263Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:05.6170528Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:05.6170780Z #define __GNUC__ 11
2025-05-07T20:25:05.6170988Z #define __pie__ 2
2025-05-07T20:25:05.6171201Z #define __MMX__ 1
2025-05-07T20:25:05.6171420Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:05.6171683Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:05.6171964Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:05.6172248Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:05.6172600Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:05.6173018Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6173350Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:05.6173612Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:05.6173875Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:05.6174182Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:05.6174449Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:05.6174712Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:05.6174996Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:05.6175297Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:05.6175599Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:05.6175912Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:05.6176167Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:05.6176439Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:05.6176708Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:05.6185567Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:05.6185839Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:05.6186176Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:05.6186565Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:05.6186847Z #define __SSE2_MATH__ 1
2025-05-07T20:25:05.6187091Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:05.6187399Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6187706Z #define __amd64 1
2025-05-07T20:25:05.6187930Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:05.6188202Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:05.6188514Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:05.6188832Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:05.6189096Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:05.6189379Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:05.6189631Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:05.6190034Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:05.6190307Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:05.6190564Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:05.6190834Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:05.6191121Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:05.6191545Z #define __x86_64 1
2025-05-07T20:25:05.6191764Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:05.6192138Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:05.6192613Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:05.6193081Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:05.6193580Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:05.6193984Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:05.6194235Z #define __LP64__ 1
2025-05-07T20:25:05.6194465Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6194827Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:05.6195344Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:05.6195617Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:05.6195896Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:05.6196190Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:05.6196476Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:05.6196749Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:05.6197011Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:05.6197265Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:05.6197529Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:05.6197868Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:05.6198233Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:05.6198517Z #define __FLT_DIG__ 6
2025-05-07T20:25:05.6198747Z #define __NO_INLINE__ 1
2025-05-07T20:25:05.6198979Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:05.6199314Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:05.6199675Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:05.6199939Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:05.6200197Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:05.6200453Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:05.6200712Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:05.6200967Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:05.6201269Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:05.6201558Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:05.6201821Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:05.6202128Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:05.6202471Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:05.6202732Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:05.6202996Z #define __FLT128_DIG__ 33
2025-05-07T20:25:05.6203241Z #define __INT32_C(c) c
2025-05-07T20:25:05.6203477Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:05.6203766Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:05.6204049Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:05.6204341Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:05.6204661Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:05.6204979Z #define unix 1
2025-05-07T20:25:05.6205207Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:05.6205526Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6205838Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:05.6206157Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:05.6206487Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:05.6206739Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:05.6207005Z #define __ELF__ 1
2025-05-07T20:25:05.6207227Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:05.6207514Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:05.6207795Z #define __FLT_RADIX__ 2
2025-05-07T20:25:05.6208037Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:05.6208410Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:05.6208789Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:05.6209053Z #define __SSE_MATH__ 1
2025-05-07T20:25:05.6209271Z #define __k8 1
2025-05-07T20:25:05.6209572Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:05.6209960Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:05.6210345Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:05.6210655Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:05.6210920Z #define __LDBL_DIG__ 18
2025-05-07T20:25:05.6211152Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:05.6211409Z #define __x86_64__ 1
2025-05-07T20:25:05.6211647Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:05.6211944Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:05.6212284Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6212595Z #define __FLT64_DIG__ 15
2025-05-07T20:25:05.6212875Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6213227Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:05.6213546Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6213923Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:05.6214194Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6214495Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:05.6214879Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:05.6215283Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:05.6215576Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:05.6215918Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:05.6216278Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:05.6216588Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:05.6216866Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:05.6217179Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:05.6217461Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:05.6217696Z #define __SEG_FS 1
2025-05-07T20:25:05.6217929Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:05.6218210Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:05.6218490Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6218782Z #define __SEG_GS 1
2025-05-07T20:25:05.6219103Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:05.6219506Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:05.6219774Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:05.6220070Z #define __INT16_TYPE__ short int
2025-05-07T20:25:05.6220351Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:05.6220647Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:05.6220912Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:05.6221158Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:05.6221414Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:05.6221764Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:05.6222166Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6222459Z #define linux 1
2025-05-07T20:25:05.6222691Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6222980Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:05.6223265Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:05.6223515Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:05.6223784Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:05.6224055Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:05.6224419Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:05.6224855Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:05.6225199Z #define __code_model_small__ 1
2025-05-07T20:25:05.6225473Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:05.6225771Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:05.6226025Z #define __k8__ 1
2025-05-07T20:25:05.6226251Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:05.6226545Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:05.6226856Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:05.6227098Z #define __pic__ 2
2025-05-07T20:25:05.6227351Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6227673Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:05.6227984Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6228320Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:05.6228708Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:05.6229181Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:05.6229449Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:05.6229807Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:05.6230128Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:05.6230376Z #define __linux__ 1
2025-05-07T20:25:05.6230607Z #define __INT64_TYPE__ long int
2025-05-07T20:25:05.6230872Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:05.6231125Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:05.6231404Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:05.6231666Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:05.6231962Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6232291Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:05.6232677Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:05.6232951Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:05.6233241Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:05.6233543Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:05.6233892Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:05.6234258Z #define __SSE__ 1
2025-05-07T20:25:05.6234482Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:05.6234826Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:05.6235177Z #define __amd64__ 1
2025-05-07T20:25:05.6235405Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:05.6235658Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:05.6235922Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:05.6236200Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:05.6236470Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:05.6236745Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:05.6237001Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:05.6237282Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:05.6237555Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:05.6237905Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:05.6238401Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:05.6238774Z #define _LP64 1
2025-05-07T20:25:05.6238981Z #define __UINT8_C(c) c
2025-05-07T20:25:05.6239219Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:05.6239485Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:05.6239750Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:05.6240022Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:05.6240323Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:05.6240690Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:05.6241168Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:05.6241551Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6241853Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:05.6242161Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:05.6242538Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:05.6242920Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:05.6243188Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:05.6243525Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:05.6243903Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:05.6244162Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:05.6244405Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:05.6244661Z #define __FXSR__ 1
2025-05-07T20:25:05.6244965Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:05.6245436Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:05.6245858Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:05.6246169Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:05.6246419Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:05.6246759Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:05.6247127Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:05.6247370Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:05.6247692Z #define __PIC__ 2
2025-05-07T20:25:05.6247943Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:05.6248356Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:05.6248748Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:05.6249088Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:05.6249427Z #define __SSE2__ 1
2025-05-07T20:25:05.6249640Z #define __INT32_TYPE__ int
2025-05-07T20:25:05.6249897Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:05.6250161Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:05.6250495Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:05.6250866Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:05.6251219Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:05.6251495Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:05.6251761Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6252040Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:05.6252297Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:05.6252540Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:05.6252830Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6253131Z #define __PIE__ 2
2025-05-07T20:25:05.6253453Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:05.6253863Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:05.6254217Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:05.6254589Z #define __INT16_C(c) c
2025-05-07T20:25:05.6254823Z #define __STDC__ 1
2025-05-07T20:25:05.6255062Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:05.6255332Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:05.6255594Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:05.6255908Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:05.6256266Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:05.6256607Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:05.6256884Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:05.6257174Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:05.6257439Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:05.6257728Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:05.6258030Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:05.6258307Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:05.6258611Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:05.6259023Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:05.6259406Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:05.6259716Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:05.6260021Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:05.6260273Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:05.6260439Z 
2025-05-07T20:25:05.6781252Z 
2025-05-07T20:25:05.6781605Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:05.6782253Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:05.6782585Z 
2025-05-07T20:25:07.5592642Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:07.5593141Z #define __cpp_attributes 200809L
2025-05-07T20:25:07.5593598Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:07.5594019Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:07.5594314Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:07.5594584Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:07.5594929Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:07.5595288Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:07.5595575Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:07.5595944Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:07.5596417Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:07.5596809Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:07.5597152Z #define __CHAR_BIT__ 8
2025-05-07T20:25:07.5597458Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:07.5597712Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:07.5597966Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:07.5600066Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:07.5600367Z #define __cpp_static_assert 201411L
2025-05-07T20:25:07.5600656Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:07.5600965Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5601274Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:07.5601568Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:07.5601894Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:07.5602226Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:07.5602645Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:07.5603068Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:07.5603547Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:07.5603837Z #define __GCC_IEC_559 2
2025-05-07T20:25:07.5604080Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:07.5604359Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:07.5604647Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:07.5604937Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:07.5605241Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:07.5605568Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:07.5605885Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:07.5606216Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5606546Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:07.5606820Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.5607096Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:07.5607375Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:07.5607677Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:07.5607940Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:07.5608212Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:07.5608489Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:07.5608821Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:07.5609155Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:07.5609420Z #define __INT8_C(c) c
2025-05-07T20:25:07.5609663Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:07.5609929Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:07.5610255Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5610587Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:07.5610860Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:07.5611163Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:07.5611493Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:07.5611854Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:07.5612143Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:07.5612427Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:07.5612688Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5612975Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:07.5613254Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:07.5613655Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:07.5614092Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:07.5614386Z #define __linux 1
2025-05-07T20:25:07.5614617Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:07.5614892Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:07.5615179Z #define __unix 1
2025-05-07T20:25:07.5615407Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:07.5615689Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:07.5615984Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:07.5616265Z #define __WINT_MIN__ 0U
2025-05-07T20:25:07.5616506Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.5616793Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:07.5617073Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:07.5617346Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:07.5617602Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:07.5617891Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:07.5618185Z #define __INT64_C(c) c ## L
2025-05-07T20:25:07.5618558Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:07.5618874Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:07.5619159Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:07.5619471Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:07.5619762Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:07.5620034Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:07.5620399Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:07.5620806Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:07.5621067Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:07.5621346Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:07.5621631Z #define __DBL_DIG__ 15
2025-05-07T20:25:07.5621868Z #define __FLT32_DIG__ 6
2025-05-07T20:25:07.5622256Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:07.5622616Z #define __GXX_WEAK__ 1
2025-05-07T20:25:07.5622851Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:07.5623105Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:07.5623441Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:07.5623806Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:07.5624075Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:07.5624376Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:07.5624713Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:07.5625129Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:07.5625548Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:07.5625839Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:07.5626107Z #define __unix__ 1
2025-05-07T20:25:07.5626339Z #define __INT_WIDTH__ 32
2025-05-07T20:25:07.5626591Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:07.5626844Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:07.5627109Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:07.5627385Z #define __UINT16_C(c) c
2025-05-07T20:25:07.5627621Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:07.5627906Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:07.5628287Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:07.5628667Z #define __gnu_linux__ 1
2025-05-07T20:25:07.5628915Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:07.5629187Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:07.5629473Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.5629982Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5630292Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:07.5630557Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:07.5630825Z #define __GNUC__ 11
2025-05-07T20:25:07.5640541Z #define __GXX_RTTI 1
2025-05-07T20:25:07.5640829Z #define __pie__ 2
2025-05-07T20:25:07.5641047Z #define __MMX__ 1
2025-05-07T20:25:07.5641280Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:07.5641578Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:07.5641869Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:07.5642153Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:07.5642419Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:07.5642739Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:07.5643071Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:07.5643443Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:07.5643845Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:07.5644161Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5644498Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:07.5644777Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:07.5645053Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:07.5645380Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:07.5645693Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:07.5645965Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:07.5646248Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:07.5646548Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:07.5646862Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:07.5647142Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:07.5647630Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:07.5647895Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:07.5648158Z #define __cplusplus 201703L
2025-05-07T20:25:07.5648433Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:07.5648728Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:07.5648989Z #define __DEPRECATED 1
2025-05-07T20:25:07.5649258Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:07.5649567Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:07.5649826Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:07.5650163Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:07.5650540Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:07.5650817Z #define __SSE2_MATH__ 1
2025-05-07T20:25:07.5651082Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:07.5651488Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5651787Z #define __amd64 1
2025-05-07T20:25:07.5652009Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:07.5652282Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:07.5652562Z #define __GNUG__ 11
2025-05-07T20:25:07.5652818Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:07.5653137Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:07.5653396Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:07.5653656Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:07.5653945Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:07.5654207Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:07.5654483Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:07.5654786Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:07.5655058Z #define __cpp_hex_float 201603L
2025-05-07T20:25:07.5655329Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:07.5655609Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:07.5655908Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:07.5656180Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:07.5656466Z #define __x86_64 1
2025-05-07T20:25:07.5656709Z #define __cpp_lambdas 200907L
2025-05-07T20:25:07.5656992Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:07.5657386Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:07.5657804Z #define __cpp_template_auto 201606L
2025-05-07T20:25:07.5658189Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:07.5658668Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:07.5659179Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:07.5659594Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:07.5659855Z #define __LP64__ 1
2025-05-07T20:25:07.5660093Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5660469Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:07.5660879Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:07.5661160Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.5661458Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:07.5661747Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:07.5662025Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:07.5662299Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:07.5662573Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:07.5662914Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:07.5663297Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:07.5663589Z #define __FLT_DIG__ 6
2025-05-07T20:25:07.5663822Z #define __NO_INLINE__ 1
2025-05-07T20:25:07.5664073Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:07.5664417Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:07.5664791Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:07.5665052Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:07.5665332Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:07.5665607Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:07.5665904Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:07.5666253Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:07.5666520Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:07.5666958Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:07.5667255Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:07.5667529Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:07.5667905Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:07.5668256Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:07.5668553Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:07.5668827Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:07.5669092Z #define __FLT128_DIG__ 33
2025-05-07T20:25:07.5669329Z #define __INT32_C(c) c
2025-05-07T20:25:07.5669580Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:07.5670077Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:07.5670362Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:07.5670801Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:07.5671128Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:07.5671438Z #define unix 1
2025-05-07T20:25:07.5671669Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:07.5671946Z #define __cpp_rtti 199711L
2025-05-07T20:25:07.5672222Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:07.5672544Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5672864Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:07.5673186Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:07.5673525Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:07.5673792Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:07.5674094Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:07.5674382Z #define __ELF__ 1
2025-05-07T20:25:07.5674628Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:07.5674923Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:07.5675200Z #define __FLT_RADIX__ 2
2025-05-07T20:25:07.5675457Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:07.5675854Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:07.5676236Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:07.5676519Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:07.5676816Z #define __k8 1
2025-05-07T20:25:07.5677127Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:07.5677515Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:07.5677825Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:07.5678141Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:07.5678407Z #define __LDBL_DIG__ 18
2025-05-07T20:25:07.5678660Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:07.5678928Z #define __x86_64__ 1
2025-05-07T20:25:07.5679168Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:07.5679491Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:07.5679849Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5680163Z #define __FLT64_DIG__ 15
2025-05-07T20:25:07.5680462Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5680829Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:07.5681159Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5681429Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:07.5681722Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5682033Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:07.5682412Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:07.5683200Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:07.5683514Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:07.5683841Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:07.5684167Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:07.5684508Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:07.5684815Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:07.5685107Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:07.5685439Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:07.5685737Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:07.5685979Z #define __SEG_FS 1
2025-05-07T20:25:07.5686219Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:07.5686505Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:07.5687026Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5687325Z #define __SEG_GS 1
2025-05-07T20:25:07.5687650Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:07.5688043Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:07.5688321Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:07.5688623Z #define __INT16_TYPE__ short int
2025-05-07T20:25:07.5688901Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:07.5689224Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:07.5689538Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:07.5689792Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:07.5690070Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:07.5690431Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:07.5690984Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5691307Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:07.5691654Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:07.5691967Z #define linux 1
2025-05-07T20:25:07.5692192Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5692482Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:07.5692767Z #define __EXCEPTIONS 1
2025-05-07T20:25:07.5693012Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:07.5693286Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:07.5693576Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:07.5693871Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:07.5694234Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:07.5694648Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:07.5695015Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:07.5695360Z #define __code_model_small__ 1
2025-05-07T20:25:07.5695645Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:07.5695968Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:07.5696282Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:07.5696578Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:07.5696880Z #define __k8__ 1
2025-05-07T20:25:07.5697107Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:07.5697404Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:07.5697719Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:07.5697979Z #define __pic__ 2
2025-05-07T20:25:07.5698229Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5698551Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:07.5698832Z #define __cpp_decltype 200707L
2025-05-07T20:25:07.5699124Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5699465Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:07.5699849Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:07.5700224Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:07.5700532Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:07.5700865Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:07.5701167Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:07.5701427Z #define __linux__ 1
2025-05-07T20:25:07.5701661Z #define __INT64_TYPE__ long int
2025-05-07T20:25:07.5701933Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:07.5702197Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:07.5702480Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:07.5702773Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:07.5703096Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:07.5703400Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5703727Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:07.5703995Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:07.5704296Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:07.5704605Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:07.5704950Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:07.5705332Z #define __SSE__ 1
2025-05-07T20:25:07.5705565Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:07.5706005Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:07.5706366Z #define __amd64__ 1
2025-05-07T20:25:07.5706592Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:07.5706846Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:07.5707114Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:07.5707383Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:07.5707660Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:07.5707917Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:07.5708194Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:07.5708465Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:07.5708814Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:07.5709299Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:07.5709882Z #define _LP64 1
2025-05-07T20:25:07.5710096Z #define __UINT8_C(c) c
2025-05-07T20:25:07.5710342Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:07.5710620Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:07.5710900Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:07.5711173Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:07.5711548Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:07.5712043Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:07.5712432Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5712738Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:07.5713064Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:07.5713384Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:07.5713786Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:07.5714180Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:07.5714456Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:07.5714728Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:07.5715086Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:07.5715473Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:07.5715741Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:07.5715998Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:07.5716256Z #define __FXSR__ 1
2025-05-07T20:25:07.5716561Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:07.5717040Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:07.5717471Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:07.5717786Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:07.5718059Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:07.5718370Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:07.5718672Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:07.5718952Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:07.5719339Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:07.5719727Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:07.5719998Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:07.5720254Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:07.5720504Z #define __PIC__ 2
2025-05-07T20:25:07.5720757Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:07.5721181Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:07.5721590Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:07.5721938Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:07.5722302Z #define __cpp_constexpr 201603L
2025-05-07T20:25:07.5722571Z #define __SSE2__ 1
2025-05-07T20:25:07.5722808Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:07.5723114Z #define __INT32_TYPE__ int
2025-05-07T20:25:07.5723376Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:07.5723649Z #define __cpp_exceptions 199711L
2025-05-07T20:25:07.5723932Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:07.5724287Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:07.5724666Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:07.5724942Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:07.5725321Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:07.5725601Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5725900Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:07.5726178Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:07.5726434Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:07.5726731Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:07.5727030Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5727344Z #define __PIE__ 2
2025-05-07T20:25:07.5727674Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:07.5728113Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:07.5728434Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:07.5728798Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:07.5729347Z #define __INT16_C(c) c
2025-05-07T20:25:07.5729575Z #define __STDC__ 1
2025-05-07T20:25:07.5729795Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:07.5730049Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:07.5730323Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:07.5730581Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:07.5730877Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:07.5731233Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:07.5731582Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:07.5731849Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:07.5732150Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:07.5732441Z #define __SSE_MATH__ 1
2025-05-07T20:25:07.5732688Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:07.5732976Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:07.5733297Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:07.5733597Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:07.5733898Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:07.5734181Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:07.5734489Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:07.5734901Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:07.5735297Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:07.5735612Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:07.5735910Z #define _GNU_SOURCE 1
2025-05-07T20:25:07.5736169Z #define __cpp_init_captures 201304L
2025-05-07T20:25:07.5736457Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:07.5736716Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:07.5736880Z 
2025-05-07T20:25:07.6245196Z 
2025-05-07T20:25:07.6245925Z + conda run -n build_binary c++ --version
2025-05-07T20:25:07.6246252Z 
2025-05-07T20:25:09.5189446Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:09.5190202Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:09.5190906Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:09.5191711Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:09.5192204Z 
2025-05-07T20:25:09.5192210Z 
2025-05-07T20:25:09.5821058Z 
2025-05-07T20:25:09.5821884Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:09.5822490Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:09.5822813Z 
2025-05-07T20:25:11.5343311Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:11.5346806Z 
2025-05-07T20:25:11.5347169Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:11.5347755Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:11.5348087Z 
2025-05-07T20:25:13.4883019Z #define __cplusplus 201703L
2025-05-07T20:25:13.4885358Z 
2025-05-07T20:25:13.4886070Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:13.4931226Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0
2025-05-07T20:25:13.4931647Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.8.0[0m
2025-05-07T20:25:13.4944617Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:13.4944974Z env:
2025-05-07T20:25:13.4945198Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:13.4945495Z   BUILD_ENV: build_binary
2025-05-07T20:25:13.4945740Z   BUILD_TARGET: genai
2025-05-07T20:25:13.4945970Z   BUILD_VARIANT: cuda
2025-05-07T20:25:13.4946199Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:25:13.4946455Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:13.4946758Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:13.4947090Z ##[endgroup]
2025-05-07T20:25:13.8298402Z ################################################################################
2025-05-07T20:25:13.8298775Z # Install CUDA
2025-05-07T20:25:13.8298982Z #
2025-05-07T20:25:13.8315157Z # [2025-05-07T20:25:13.831Z] + install_cuda build_binary 12.8.0
2025-05-07T20:25:13.8315879Z ################################################################################
2025-05-07T20:25:13.8316103Z 
2025-05-07T20:25:13.8330920Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:13.9240524Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:13.9240879Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:13.9245601Z + conda clean --packages --tarball -y
2025-05-07T20:25:13.9245924Z 
2025-05-07T20:25:14.6336363Z Will remove 32 (140.4 MB) tarball(s).
2025-05-07T20:25:14.6336707Z Will remove 6 (617 KB) package(s).
2025-05-07T20:25:14.7133826Z 
2025-05-07T20:25:14.7142138Z + conda clean --all -y
2025-05-07T20:25:14.7142316Z 
2025-05-07T20:25:15.3825238Z There are no unused tarball(s) to remove.
2025-05-07T20:25:15.3825919Z Will remove 1 index cache(s).
2025-05-07T20:25:15.3826489Z There are no unused package(s) to remove.
2025-05-07T20:25:15.3827115Z There are no tempfile(s) to remove.
2025-05-07T20:25:15.3827725Z There are no logfile(s) to remove.
2025-05-07T20:25:15.4464107Z 
2025-05-07T20:25:15.4478141Z [INSTALL] Installing CUDA 12.8.0 ...
2025-05-07T20:25:15.4502292Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0
2025-05-07T20:25:16.3554719Z Channels:
2025-05-07T20:25:16.3555301Z  - conda-forge
2025-05-07T20:25:26.8573259Z Platform: linux-64
2025-05-07T20:25:26.8574912Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:25:27.9781591Z Solving environment: / - \ | done
2025-05-07T20:25:28.0516181Z 
2025-05-07T20:25:28.0516806Z ## Package Plan ##
2025-05-07T20:25:28.0517042Z 
2025-05-07T20:25:28.0517339Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:28.0517677Z 
2025-05-07T20:25:28.0517779Z   added / updated specs:
2025-05-07T20:25:28.0518047Z     - cuda=12.8.0
2025-05-07T20:25:28.0518212Z 
2025-05-07T20:25:28.0518234Z 
2025-05-07T20:25:28.0518366Z The following packages will be downloaded:
2025-05-07T20:25:28.0518652Z 
2025-05-07T20:25:28.0518812Z     package                    |            build
2025-05-07T20:25:28.0519277Z     ---------------------------|-----------------
2025-05-07T20:25:28.0519746Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:25:28.0520410Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:25:28.0520987Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:25:28.0521507Z     bzip2-1.0.8                |       h4bc722e_7         247 KB  conda-forge
2025-05-07T20:25:28.0521942Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:25:28.0522395Z     cuda-12.8.0                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:25:28.0522969Z     cuda-cccl_linux-64-12.8.55 |       ha770c72_1         1.0 MB  conda-forge
2025-05-07T20:25:28.0524189Z     cuda-command-line-tools-12.8.0|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:28.0524741Z     cuda-compiler-12.8.0       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:25:28.0525249Z     cuda-crt-dev_linux-64-12.8.61|       ha770c72_1          90 KB  conda-forge
2025-05-07T20:25:28.0525752Z     cuda-crt-tools-12.8.61     |       ha770c72_1          27 KB  conda-forge
2025-05-07T20:25:28.0526223Z     cuda-cudart-12.8.57        |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:25:28.0526720Z     cuda-cudart-dev-12.8.57    |       h5888daf_1          23 KB  conda-forge
2025-05-07T20:25:28.0527246Z     cuda-cudart-dev_linux-64-12.8.57|       h3f2d84a_1         377 KB  conda-forge
2025-05-07T20:25:28.0527784Z     cuda-cudart-static-12.8.57 |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:25:28.0528335Z     cuda-cudart-static_linux-64-12.8.57|       h3f2d84a_1         950 KB  conda-forge
2025-05-07T20:25:28.0529059Z     cuda-cudart_linux-64-12.8.57|       h3f2d84a_1         188 KB  conda-forge
2025-05-07T20:25:28.0529573Z     cuda-cuobjdump-12.8.55     |       hbd13f7d_0         227 KB  conda-forge
2025-05-07T20:25:28.0530050Z     cuda-cupti-12.8.57         |       hbd13f7d_0         1.8 MB  conda-forge
2025-05-07T20:25:28.0530522Z     cuda-cupti-dev-12.8.57     |       h5888daf_0         4.0 MB  conda-forge
2025-05-07T20:25:28.0531019Z     cuda-cuxxfilt-12.8.55      |       hbd13f7d_0         211 KB  conda-forge
2025-05-07T20:25:28.0531559Z     cuda-driver-dev-12.8.57    |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:25:28.0532082Z     cuda-driver-dev_linux-64-12.8.90|       h3f2d84a_1          36 KB  conda-forge
2025-05-07T20:25:28.0532584Z     cuda-gdb-12.8.55           |       h50b4baa_0         353 KB  conda-forge
2025-05-07T20:25:28.0533060Z     cuda-libraries-12.8.0      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:28.0533574Z     cuda-libraries-dev-12.8.0  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:28.0534080Z     cuda-nsight-12.8.55        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:25:28.0534547Z     cuda-nvcc-12.8.61          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:25:28.0535043Z     cuda-nvcc-dev_linux-64-12.8.61|       he91c749_1        12.7 MB  conda-forge
2025-05-07T20:25:28.0535550Z     cuda-nvcc-impl-12.8.61     |       h85509e4_1          25 KB  conda-forge
2025-05-07T20:25:28.0536047Z     cuda-nvcc-tools-12.8.61    |       he02047a_1        24.5 MB  conda-forge
2025-05-07T20:25:28.0536551Z     cuda-nvcc_linux-64-12.8.61 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:25:28.0537043Z     cuda-nvdisasm-12.8.55      |       hbd13f7d_0         4.9 MB  conda-forge
2025-05-07T20:25:28.0537529Z     cuda-nvml-dev-12.8.55      |       hbd13f7d_0         134 KB  conda-forge
2025-05-07T20:25:28.0538008Z     cuda-nvprof-12.8.57        |       hbd13f7d_0         2.5 MB  conda-forge
2025-05-07T20:25:28.0538504Z     cuda-nvprune-12.8.55       |       hbd13f7d_0          68 KB  conda-forge
2025-05-07T20:25:28.0538978Z     cuda-nvrtc-12.8.61         |       hbd13f7d_0        63.1 MB  conda-forge
2025-05-07T20:25:28.0539455Z     cuda-nvrtc-dev-12.8.61     |       h5888daf_0          34 KB  conda-forge
2025-05-07T20:25:28.0539926Z     cuda-nvtx-12.8.55          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:25:28.0540420Z     cuda-nvvm-dev_linux-64-12.8.61|       ha770c72_1          25 KB  conda-forge
2025-05-07T20:25:28.0540925Z     cuda-nvvm-impl-12.8.61     |       he02047a_1        20.8 MB  conda-forge
2025-05-07T20:25:28.0541422Z     cuda-nvvm-tools-12.8.61    |       he02047a_1        23.5 MB  conda-forge
2025-05-07T20:25:28.0541904Z     cuda-nvvp-12.8.57          |       hbd13f7d_0       112.4 MB  conda-forge
2025-05-07T20:25:28.0542365Z     cuda-opencl-12.8.55        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:25:28.0542858Z     cuda-opencl-dev-12.8.55    |       h5888daf_0          95 KB  conda-forge
2025-05-07T20:25:28.0543502Z     cuda-profiler-api-12.8.55  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:25:28.0544003Z     cuda-runtime-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:25:28.0544501Z     cuda-sanitizer-api-12.8.55 |       hbd13f7d_0         8.8 MB  conda-forge
2025-05-07T20:25:28.0545002Z     cuda-toolkit-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:25:28.0545465Z     cuda-tools-12.8.0          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:28.0545920Z     cuda-version-12.8          |       h5d125a7_3          21 KB  conda-forge
2025-05-07T20:25:28.0546416Z     cuda-visual-tools-12.8.0   |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:28.0546917Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:25:28.0547361Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:25:28.0547774Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:25:28.0548355Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:25:28.0548911Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:25:28.0549455Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:25:28.0550139Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:25:28.0550627Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:25:28.0551132Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:25:28.0551641Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:25:28.0552119Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:25:28.0552550Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:28.0552997Z     gds-tools-1.13.0.11        |       h5888daf_0        37.9 MB  conda-forge
2025-05-07T20:25:28.0553423Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:25:28.0553827Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:28.0554259Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:25:28.0554685Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:25:28.0555107Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:25:28.0555563Z     libcublas-12.8.3.14        |       h9ab20c4_0       460.2 MB  conda-forge
2025-05-07T20:25:28.0556053Z     libcublas-dev-12.8.3.14    |       h9ab20c4_0          89 KB  conda-forge
2025-05-07T20:25:28.0556530Z     libcufft-11.3.3.41         |       hbd13f7d_0       147.4 MB  conda-forge
2025-05-07T20:25:28.0557010Z     libcufft-dev-11.3.3.41     |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:25:28.0557500Z     libcufile-1.13.0.11        |       h12f29b5_0         939 KB  conda-forge
2025-05-07T20:25:28.0557978Z     libcufile-dev-1.13.0.11    |       h5888daf_0          35 KB  conda-forge
2025-05-07T20:25:28.0558458Z     libcurand-10.3.9.55        |       hbd13f7d_0        43.6 MB  conda-forge
2025-05-07T20:25:28.0558935Z     libcurand-dev-10.3.9.55    |       h5888daf_0         265 KB  conda-forge
2025-05-07T20:25:28.0559450Z     libcusolver-11.7.2.55      |       h9ab20c4_0       156.9 MB  conda-forge
2025-05-07T20:25:28.0559956Z     libcusolver-dev-11.7.2.55  |       h9ab20c4_0          59 KB  conda-forge
2025-05-07T20:25:28.0560464Z     libcusparse-12.5.7.53      |       hbd13f7d_0       164.9 MB  conda-forge
2025-05-07T20:25:28.0560962Z     libcusparse-dev-12.5.7.53  |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:25:28.0561470Z     libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
2025-05-07T20:25:28.0562048Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:25:28.0562506Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:25:28.0562997Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:25:28.0563489Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:25:28.0563960Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:25:28.0564401Z     libglvnd-1.7.0             |       ha4b6fd6_2         129 KB  conda-forge
2025-05-07T20:25:28.0564868Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:25:28.0565334Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:25:28.0565765Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:25:28.0566211Z     libnpp-12.3.3.65           |       hbd13f7d_0       130.6 MB  conda-forge
2025-05-07T20:25:28.0566755Z     libnpp-dev-12.3.3.65       |       h5888daf_0         443 KB  conda-forge
2025-05-07T20:25:28.0567210Z     libnsl-2.0.1               |       hd590300_0          33 KB  conda-forge
2025-05-07T20:25:28.0567641Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:25:28.0568105Z     libnvfatbin-12.8.55        |       hbd13f7d_0         793 KB  conda-forge
2025-05-07T20:25:28.0568598Z     libnvfatbin-dev-12.8.55    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:25:28.0569094Z     libnvjitlink-12.8.61       |       hbd13f7d_0        28.7 MB  conda-forge
2025-05-07T20:25:28.0569600Z     libnvjitlink-dev-12.8.61   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:25:28.0570095Z     libnvjpeg-12.3.5.57        |       h97fd463_0         3.0 MB  conda-forge
2025-05-07T20:25:28.0570577Z     libnvjpeg-dev-12.3.5.57    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:25:28.0571066Z     libopengl-1.7.0            |       ha4b6fd6_2          50 KB  conda-forge
2025-05-07T20:25:28.0571513Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:25:28.0571958Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:25:28.0572427Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:25:28.0572898Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:25:28.0573347Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:25:28.0573783Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:25:28.0574237Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:25:28.0574717Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:25:28.0575174Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:25:28.0575620Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:25:28.0576042Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:25:28.0576513Z     nsight-compute-2025.1.0.14 |       hb5ebaad_0       320.6 MB  conda-forge
2025-05-07T20:25:28.0576986Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:25:28.0577396Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:25:28.0577822Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:25:28.0578304Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:25:28.0578775Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:25:28.0579231Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:25:28.0579817Z     python-3.9.18              |h0755675_1_cpython        22.7 MB  conda-forge
2025-05-07T20:25:28.0580281Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:25:28.0580717Z     sqlite-3.32.3              |       hcee41ef_1         1.4 MB  conda-forge
2025-05-07T20:25:28.0581148Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:25:28.0581580Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:25:28.0582022Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:25:28.0582488Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:25:28.0583301Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:25:28.0583798Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:25:28.0584319Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:25:28.0584977Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:25:28.0585468Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:25:28.0585959Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:25:28.0586415Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:25:28.0586879Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:25:28.0587351Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:25:28.0587859Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:25:28.0588373Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:25:28.0588871Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:28.0589369Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:25:28.0589980Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:28.0590458Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:25:28.0590961Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:25:28.0591487Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:25:28.0591971Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:25:28.0592415Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:25:28.0592830Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:25:28.0593232Z     ------------------------------------------------------------
2025-05-07T20:25:28.0593604Z                                            Total:        1.90 GB
2025-05-07T20:25:28.0593841Z 
2025-05-07T20:25:28.0593981Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:28.0594214Z 
2025-05-07T20:25:28.0594432Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:25:28.0594878Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:25:28.0595325Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:25:28.0595780Z   bzip2              conda-forge/linux-64::bzip2-1.0.8-h4bc722e_7 
2025-05-07T20:25:28.0596249Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:25:28.0596696Z   cuda               conda-forge/noarch::cuda-12.8.0-ha804496_0 
2025-05-07T20:25:28.0597187Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 
2025-05-07T20:25:28.0597816Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 
2025-05-07T20:25:28.0598430Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 
2025-05-07T20:25:28.0599146Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:25:28.0599748Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 
2025-05-07T20:25:28.0600288Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 
2025-05-07T20:25:28.0600842Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 
2025-05-07T20:25:28.0601445Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:25:28.0602251Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 
2025-05-07T20:25:28.0602908Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:25:28.0603551Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:25:28.0604137Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 
2025-05-07T20:25:28.0604690Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 
2025-05-07T20:25:28.0605307Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 
2025-05-07T20:25:28.0605862Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 
2025-05-07T20:25:28.0606429Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 
2025-05-07T20:25:28.0607037Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 
2025-05-07T20:25:28.0607604Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 
2025-05-07T20:25:28.0608267Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 
2025-05-07T20:25:28.0608873Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 
2025-05-07T20:25:28.0609452Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 
2025-05-07T20:25:28.0609955Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 
2025-05-07T20:25:28.0610512Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 
2025-05-07T20:25:28.0611110Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 
2025-05-07T20:25:28.0611727Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 
2025-05-07T20:25:28.0612452Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 
2025-05-07T20:25:28.0613027Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 
2025-05-07T20:25:28.0613578Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 
2025-05-07T20:25:28.0614121Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 
2025-05-07T20:25:28.0614651Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 
2025-05-07T20:25:28.0615172Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 
2025-05-07T20:25:28.0615708Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 
2025-05-07T20:25:28.0616240Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 
2025-05-07T20:25:28.0616788Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:25:28.0617374Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 
2025-05-07T20:25:28.0617945Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 
2025-05-07T20:25:28.0618478Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 
2025-05-07T20:25:28.0618984Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 
2025-05-07T20:25:28.0619640Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 
2025-05-07T20:25:28.0620277Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 
2025-05-07T20:25:28.0620857Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 
2025-05-07T20:25:28.0621555Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 
2025-05-07T20:25:28.0622151Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 
2025-05-07T20:25:28.0622659Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 
2025-05-07T20:25:28.0623161Z   cuda-version       conda-forge/noarch::cuda-version-12.8-h5d125a7_3 
2025-05-07T20:25:28.0623716Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 
2025-05-07T20:25:28.0624297Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:25:28.0624774Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:25:28.0625198Z   expat              conda-forge/linux-64::expat-2.7.0-h5888daf_0 
2025-05-07T20:25:28.0625737Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:25:28.0626382Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:25:28.0627141Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:25:28.0627749Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:25:28.0628282Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:25:28.0628801Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:25:28.0629318Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:25:28.0629890Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:25:28.0630338Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:25:28.0630784Z   gds-tools          conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 
2025-05-07T20:25:28.0631290Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:25:28.0631689Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:25:28.0632130Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:25:28.0632576Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:25:28.0633004Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:25:28.0633478Z   libcublas          conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 
2025-05-07T20:25:28.0634020Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 
2025-05-07T20:25:28.0634561Z   libcufft           conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 
2025-05-07T20:25:28.0635083Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 
2025-05-07T20:25:28.0635646Z   libcufile          conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 
2025-05-07T20:25:28.0636265Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 
2025-05-07T20:25:28.0636809Z   libcurand          conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 
2025-05-07T20:25:28.0637360Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 
2025-05-07T20:25:28.0637921Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 
2025-05-07T20:25:28.0638492Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 
2025-05-07T20:25:28.0639071Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 
2025-05-07T20:25:28.0639654Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 
2025-05-07T20:25:28.0640213Z   libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 
2025-05-07T20:25:28.0640694Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:25:28.0641197Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:25:28.0641734Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:25:28.0642282Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:25:28.0642897Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:25:28.0643367Z   libglvnd           conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 
2025-05-07T20:25:28.0643862Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:25:28.0644353Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:25:28.0644808Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:25:28.0645258Z   libnpp             conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 
2025-05-07T20:25:28.0645752Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 
2025-05-07T20:25:28.0646230Z   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
2025-05-07T20:25:28.0646675Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:25:28.0647170Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 
2025-05-07T20:25:28.0647821Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 
2025-05-07T20:25:28.0648385Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 
2025-05-07T20:25:28.0648968Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 
2025-05-07T20:25:28.0649536Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 
2025-05-07T20:25:28.0650079Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 
2025-05-07T20:25:28.0650607Z   libopengl          conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 
2025-05-07T20:25:28.0651124Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:25:28.0651589Z   libsqlite          conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 
2025-05-07T20:25:28.0652091Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:25:28.0652574Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:25:28.0653044Z   libuuid            conda-forge/linux-64::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:25:28.0653500Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:25:28.0653984Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:25:28.0654504Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:25:28.0654980Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:25:28.0655437Z   libzlib            conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:28.0655878Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:25:28.0656393Z   nsight-compute     conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 
2025-05-07T20:25:28.0656914Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:25:28.0657316Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:25:28.0657732Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:25:28.0658268Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:25:28.0658790Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:25:28.0659280Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:25:28.0668361Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:25:28.0668902Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:25:28.0669376Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:25:28.0670021Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:25:28.0670600Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:25:28.0671236Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:25:28.0672017Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:25:28.0672604Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:25:28.0673166Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:25:28.0673737Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:25:28.0674251Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:25:28.0674762Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:25:28.0675283Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:25:28.0675870Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:25:28.0676488Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:25:28.0677063Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:25:28.0677695Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:25:28.0678239Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:25:28.0678777Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:25:28.0679305Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:25:28.0679891Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:25:28.0680451Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:25:28.0680931Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:25:28.0681193Z 
2025-05-07T20:25:28.0681305Z The following packages will be UPDATED:
2025-05-07T20:25:28.0681516Z 
2025-05-07T20:25:28.0681775Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:28.0682135Z 
2025-05-07T20:25:28.0682376Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:25:28.0682724Z 
2025-05-07T20:25:28.0683362Z   python                pkgs/main::python-3.9.21-he870216_1 --> conda-forge::python-3.9.18-h0755675_1_cpython 
2025-05-07T20:25:28.0684046Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 
2025-05-07T20:25:28.0684667Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:25:28.0685012Z 
2025-05-07T20:25:28.0685016Z 
2025-05-07T20:25:28.0685020Z 
2025-05-07T20:25:28.0685169Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:28.0685562Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:25:28.0685817Z 
2025-05-07T20:25:28.0686232Z nsight-compute-2025. | 320.6 MB  |            |   0% [A
2025-05-07T20:25:28.0686496Z 
2025-05-07T20:25:28.0686500Z 
2025-05-07T20:25:28.0686754Z libcusparse-12.5.7.5 | 164.9 MB  |            |   0% [A[A
2025-05-07T20:25:28.0687034Z 
2025-05-07T20:25:28.0687038Z 
2025-05-07T20:25:28.0687042Z 
2025-05-07T20:25:28.0687281Z libcusolver-11.7.2.5 | 156.9 MB  |            |   0% [A[A[A
2025-05-07T20:25:28.0687556Z 
2025-05-07T20:25:28.0687560Z 
2025-05-07T20:25:28.0687564Z 
2025-05-07T20:25:28.0687568Z 
2025-05-07T20:25:28.0687807Z libcufft-11.3.3.41   | 147.4 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:28.0688083Z 
2025-05-07T20:25:28.0688087Z 
2025-05-07T20:25:28.0688091Z 
2025-05-07T20:25:28.0688094Z 
2025-05-07T20:25:28.0688098Z 
2025-05-07T20:25:28.0688339Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:28.0688620Z 
2025-05-07T20:25:28.0688623Z 
2025-05-07T20:25:28.0688627Z 
2025-05-07T20:25:28.0688631Z 
2025-05-07T20:25:28.0688634Z 
2025-05-07T20:25:28.0688638Z 
2025-05-07T20:25:28.0688898Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:28.0689194Z 
2025-05-07T20:25:28.0689198Z 
2025-05-07T20:25:28.0689201Z 
2025-05-07T20:25:28.0689378Z 
2025-05-07T20:25:28.0689383Z 
2025-05-07T20:25:28.0689387Z 
2025-05-07T20:25:28.0689390Z 
2025-05-07T20:25:28.0689644Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:28.0689933Z 
2025-05-07T20:25:28.0689937Z 
2025-05-07T20:25:28.0689940Z 
2025-05-07T20:25:28.0689944Z 
2025-05-07T20:25:28.0689947Z 
2025-05-07T20:25:28.0689951Z 
2025-05-07T20:25:28.0689954Z 
2025-05-07T20:25:28.0689958Z 
2025-05-07T20:25:28.0690943Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0691260Z 
2025-05-07T20:25:28.0691264Z 
2025-05-07T20:25:28.0691267Z 
2025-05-07T20:25:28.0691271Z 
2025-05-07T20:25:28.0691275Z 
2025-05-07T20:25:28.0691278Z 
2025-05-07T20:25:28.0691282Z 
2025-05-07T20:25:28.0691285Z 
2025-05-07T20:25:28.0691291Z 
2025-05-07T20:25:28.0695925Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0696353Z 
2025-05-07T20:25:28.0696359Z 
2025-05-07T20:25:28.0696531Z 
2025-05-07T20:25:28.0696540Z 
2025-05-07T20:25:28.0696550Z 
2025-05-07T20:25:28.0696554Z 
2025-05-07T20:25:28.0696557Z 
2025-05-07T20:25:28.0696561Z 
2025-05-07T20:25:28.0696565Z 
2025-05-07T20:25:28.0696568Z 
2025-05-07T20:25:28.0697148Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0697508Z 
2025-05-07T20:25:28.0697512Z 
2025-05-07T20:25:28.0697515Z 
2025-05-07T20:25:28.0697519Z 
2025-05-07T20:25:28.0697522Z 
2025-05-07T20:25:28.0697526Z 
2025-05-07T20:25:28.0697530Z 
2025-05-07T20:25:28.0697533Z 
2025-05-07T20:25:28.0697537Z 
2025-05-07T20:25:28.0697540Z 
2025-05-07T20:25:28.0697544Z 
2025-05-07T20:25:28.0698452Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0698845Z 
2025-05-07T20:25:28.0698849Z 
2025-05-07T20:25:28.0698853Z 
2025-05-07T20:25:28.0698856Z 
2025-05-07T20:25:28.0698860Z 
2025-05-07T20:25:28.0698864Z 
2025-05-07T20:25:28.0698867Z 
2025-05-07T20:25:28.0698871Z 
2025-05-07T20:25:28.0698888Z 
2025-05-07T20:25:28.0698892Z 
2025-05-07T20:25:28.0698896Z 
2025-05-07T20:25:28.0698899Z 
2025-05-07T20:25:28.0699737Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0700056Z 
2025-05-07T20:25:28.0700064Z 
2025-05-07T20:25:28.0700068Z 
2025-05-07T20:25:28.0700071Z 
2025-05-07T20:25:28.0700075Z 
2025-05-07T20:25:28.0700079Z 
2025-05-07T20:25:28.0700082Z 
2025-05-07T20:25:28.0700093Z 
2025-05-07T20:25:28.0700097Z 
2025-05-07T20:25:28.0700101Z 
2025-05-07T20:25:28.0700104Z 
2025-05-07T20:25:28.0700108Z 
2025-05-07T20:25:28.0700112Z 
2025-05-07T20:25:28.0701038Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0701384Z 
2025-05-07T20:25:28.0701388Z 
2025-05-07T20:25:28.0701392Z 
2025-05-07T20:25:28.0701396Z 
2025-05-07T20:25:28.0701399Z 
2025-05-07T20:25:28.0701411Z 
2025-05-07T20:25:28.0701415Z 
2025-05-07T20:25:28.0701418Z 
2025-05-07T20:25:28.0701422Z 
2025-05-07T20:25:28.0701440Z 
2025-05-07T20:25:28.0701445Z 
2025-05-07T20:25:28.0701448Z 
2025-05-07T20:25:28.0701452Z 
2025-05-07T20:25:28.0701461Z 
2025-05-07T20:25:28.0701966Z python-3.9.18        | 22.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0702268Z 
2025-05-07T20:25:28.0702273Z 
2025-05-07T20:25:28.0702276Z 
2025-05-07T20:25:28.0702280Z 
2025-05-07T20:25:28.0702283Z 
2025-05-07T20:25:28.0702287Z 
2025-05-07T20:25:28.0702291Z 
2025-05-07T20:25:28.0702297Z 
2025-05-07T20:25:28.0702489Z 
2025-05-07T20:25:28.0702493Z 
2025-05-07T20:25:28.0702507Z 
2025-05-07T20:25:28.0702518Z 
2025-05-07T20:25:28.0702522Z 
2025-05-07T20:25:28.0702525Z 
2025-05-07T20:25:28.0702645Z 
2025-05-07T20:25:28.0703175Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0703603Z 
2025-05-07T20:25:28.0703608Z 
2025-05-07T20:25:28.0703622Z 
2025-05-07T20:25:28.0703627Z 
2025-05-07T20:25:28.0703637Z 
2025-05-07T20:25:28.0703652Z 
2025-05-07T20:25:28.0703816Z 
2025-05-07T20:25:28.0703823Z 
2025-05-07T20:25:28.0703828Z 
2025-05-07T20:25:28.0703833Z 
2025-05-07T20:25:28.0703838Z 
2025-05-07T20:25:28.0703843Z 
2025-05-07T20:25:28.0703848Z 
2025-05-07T20:25:28.0703853Z 
2025-05-07T20:25:28.0703858Z 
2025-05-07T20:25:28.0703863Z 
2025-05-07T20:25:28.0704716Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0705174Z 
2025-05-07T20:25:28.0705180Z 
2025-05-07T20:25:28.0705185Z 
2025-05-07T20:25:28.0705190Z 
2025-05-07T20:25:28.0705195Z 
2025-05-07T20:25:28.0705201Z 
2025-05-07T20:25:28.0705214Z 
2025-05-07T20:25:28.0705219Z 
2025-05-07T20:25:28.0705224Z 
2025-05-07T20:25:28.0705229Z 
2025-05-07T20:25:28.0705234Z 
2025-05-07T20:25:28.0705239Z 
2025-05-07T20:25:28.0705244Z 
2025-05-07T20:25:28.0705249Z 
2025-05-07T20:25:28.0705254Z 
2025-05-07T20:25:28.0705267Z 
2025-05-07T20:25:28.0705273Z 
2025-05-07T20:25:28.0706081Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0706776Z 
2025-05-07T20:25:28.0706804Z 
2025-05-07T20:25:28.0706809Z 
2025-05-07T20:25:28.0706815Z 
2025-05-07T20:25:28.0706820Z 
2025-05-07T20:25:28.0706825Z 
2025-05-07T20:25:28.0706830Z 
2025-05-07T20:25:28.0706835Z 
2025-05-07T20:25:28.0706840Z 
2025-05-07T20:25:28.0706845Z 
2025-05-07T20:25:28.0706850Z 
2025-05-07T20:25:28.0706855Z 
2025-05-07T20:25:28.0706860Z 
2025-05-07T20:25:28.0706865Z 
2025-05-07T20:25:28.0706870Z 
2025-05-07T20:25:28.0706875Z 
2025-05-07T20:25:28.0706881Z 
2025-05-07T20:25:28.0706886Z 
2025-05-07T20:25:28.0707889Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.0708353Z 
2025-05-07T20:25:28.0708358Z 
2025-05-07T20:25:28.0708363Z 
2025-05-07T20:25:28.0708369Z 
2025-05-07T20:25:28.0708374Z 
2025-05-07T20:25:28.0708379Z 
2025-05-07T20:25:28.0708384Z 
2025-05-07T20:25:28.0708389Z 
2025-05-07T20:25:28.0708394Z 
2025-05-07T20:25:28.0708417Z 
2025-05-07T20:25:28.0708429Z 
2025-05-07T20:25:28.0708434Z 
2025-05-07T20:25:28.0708439Z 
2025-05-07T20:25:28.0708444Z 
2025-05-07T20:25:28.0708449Z 
2025-05-07T20:25:28.0708454Z 
2025-05-07T20:25:28.0708459Z 
2025-05-07T20:25:28.0708464Z 
2025-05-07T20:25:28.0708470Z 
2025-05-07T20:25:28.1616355Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.1617165Z 
2025-05-07T20:25:28.1623966Z nsight-compute-2025. | 320.6 MB  |            |   0% [A
2025-05-07T20:25:28.1624341Z 
2025-05-07T20:25:28.1626845Z 
2025-05-07T20:25:28.1629496Z libcusparse-12.5.7.5 | 164.9 MB  |            |   0% [A[A
2025-05-07T20:25:28.1639416Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:25:28.1639777Z 
2025-05-07T20:25:28.1639784Z 
2025-05-07T20:25:28.1639790Z 
2025-05-07T20:25:28.1659241Z libcusolver-11.7.2.5 | 156.9 MB  |            |   1% [A[A[A
2025-05-07T20:25:28.1659634Z 
2025-05-07T20:25:28.1659640Z 
2025-05-07T20:25:28.1659645Z 
2025-05-07T20:25:28.1661781Z 
2025-05-07T20:25:28.2617445Z libcufft-11.3.3.41   | 147.4 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:28.2617831Z 
2025-05-07T20:25:28.2624163Z nsight-compute-2025. | 320.6 MB  | 1          |   1% [A
2025-05-07T20:25:28.2624546Z 
2025-05-07T20:25:28.2629361Z 
2025-05-07T20:25:28.2633054Z libcusparse-12.5.7.5 | 164.9 MB  | 2          |   2% [A[A
2025-05-07T20:25:28.2640804Z libcublas-12.8.3.14  | 460.2 MB  |            |   1% 
2025-05-07T20:25:28.2641162Z 
2025-05-07T20:25:28.2641168Z 
2025-05-07T20:25:28.2642646Z 
2025-05-07T20:25:28.2665465Z libcusolver-11.7.2.5 | 156.9 MB  | 2          |   3% [A[A[A
2025-05-07T20:25:28.2665860Z 
2025-05-07T20:25:28.2665866Z 
2025-05-07T20:25:28.2665872Z 
2025-05-07T20:25:28.2667431Z 
2025-05-07T20:25:28.3624180Z libcufft-11.3.3.41   | 147.4 MB  | 2          |   2% [A[A[A[A
2025-05-07T20:25:28.3624574Z 
2025-05-07T20:25:28.3628048Z nsight-compute-2025. | 320.6 MB  | 2          |   2% [A
2025-05-07T20:25:28.3628899Z libcublas-12.8.3.14  | 460.2 MB  | 1          |   1% 
2025-05-07T20:25:28.3629286Z 
2025-05-07T20:25:28.3629297Z 
2025-05-07T20:25:28.3645809Z libcusparse-12.5.7.5 | 164.9 MB  | 4          |   5% [A[A
2025-05-07T20:25:28.3646195Z 
2025-05-07T20:25:28.3646200Z 
2025-05-07T20:25:28.3646205Z 
2025-05-07T20:25:28.3666942Z libcusolver-11.7.2.5 | 156.9 MB  | 4          |   5% [A[A[A
2025-05-07T20:25:28.3667340Z 
2025-05-07T20:25:28.3667346Z 
2025-05-07T20:25:28.3667351Z 
2025-05-07T20:25:28.3667661Z 
2025-05-07T20:25:28.4625684Z libcufft-11.3.3.41   | 147.4 MB  | 4          |   5% [A[A[A[A
2025-05-07T20:25:28.4627303Z 
2025-05-07T20:25:28.4631802Z nsight-compute-2025. | 320.6 MB  | 3          |   4% [A
2025-05-07T20:25:28.4637855Z libcublas-12.8.3.14  | 460.2 MB  | 2          |   2% 
2025-05-07T20:25:28.4638203Z 
2025-05-07T20:25:28.4638944Z 
2025-05-07T20:25:28.4646003Z libcusparse-12.5.7.5 | 164.9 MB  | 6          |   7% [A[A
2025-05-07T20:25:28.4646380Z 
2025-05-07T20:25:28.4646386Z 
2025-05-07T20:25:28.4646392Z 
2025-05-07T20:25:28.4828718Z libcusolver-11.7.2.5 | 156.9 MB  | 7          |   7% [A[A[A
2025-05-07T20:25:28.4829111Z 
2025-05-07T20:25:28.4829117Z 
2025-05-07T20:25:28.4829122Z 
2025-05-07T20:25:28.4829904Z 
2025-05-07T20:25:28.5627027Z libcufft-11.3.3.41   | 147.4 MB  | 6          |   6% [A[A[A[A
2025-05-07T20:25:28.5627419Z 
2025-05-07T20:25:28.5634627Z nsight-compute-2025. | 320.6 MB  | 4          |   5% [A
2025-05-07T20:25:28.5644130Z libcublas-12.8.3.14  | 460.2 MB  | 2          |   3% 
2025-05-07T20:25:28.5644471Z 
2025-05-07T20:25:28.5645783Z 
2025-05-07T20:25:28.5654854Z libcusparse-12.5.7.5 | 164.9 MB  | 8          |   9% [A[A
2025-05-07T20:25:28.5655227Z 
2025-05-07T20:25:28.5655233Z 
2025-05-07T20:25:28.5655467Z 
2025-05-07T20:25:28.5832984Z libcusolver-11.7.2.5 | 156.9 MB  | 9          |  10% [A[A[A
2025-05-07T20:25:28.5833386Z 
2025-05-07T20:25:28.5833392Z 
2025-05-07T20:25:28.5833397Z 
2025-05-07T20:25:28.5834134Z 
2025-05-07T20:25:28.6640804Z libcufft-11.3.3.41   | 147.4 MB  | 8          |   9% [A[A[A[A
2025-05-07T20:25:28.6650082Z libcublas-12.8.3.14  | 460.2 MB  | 3          |   4% 
2025-05-07T20:25:28.6650437Z 
2025-05-07T20:25:28.6652334Z 
2025-05-07T20:25:28.6659286Z libcusparse-12.5.7.5 | 164.9 MB  | #          |  11% [A[A
2025-05-07T20:25:28.6659662Z 
2025-05-07T20:25:28.6659676Z 
2025-05-07T20:25:28.6661108Z 
2025-05-07T20:25:28.6663698Z libcusolver-11.7.2.5 | 156.9 MB  | #1         |  12% [A[A[A
2025-05-07T20:25:28.6665799Z 
2025-05-07T20:25:28.6840754Z nsight-compute-2025. | 320.6 MB  | 5          |   6% [A
2025-05-07T20:25:28.6841143Z 
2025-05-07T20:25:28.6841148Z 
2025-05-07T20:25:28.6841154Z 
2025-05-07T20:25:28.6844085Z 
2025-05-07T20:25:28.7642937Z libcufft-11.3.3.41   | 147.4 MB  | #          |  11% [A[A[A[A
2025-05-07T20:25:28.7655597Z libcublas-12.8.3.14  | 460.2 MB  | 4          |   4% 
2025-05-07T20:25:28.7655942Z 
2025-05-07T20:25:28.7657609Z 
2025-05-07T20:25:28.7662223Z libcusparse-12.5.7.5 | 164.9 MB  | #2         |  13% [A[A
2025-05-07T20:25:28.7662605Z 
2025-05-07T20:25:28.7662635Z 
2025-05-07T20:25:28.7662641Z 
2025-05-07T20:25:28.7666155Z libcusolver-11.7.2.5 | 156.9 MB  | #4         |  14% [A[A[A
2025-05-07T20:25:28.7666541Z 
2025-05-07T20:25:28.7841549Z nsight-compute-2025. | 320.6 MB  | 6          |   7% [A
2025-05-07T20:25:28.7841927Z 
2025-05-07T20:25:28.7841933Z 
2025-05-07T20:25:28.7841938Z 
2025-05-07T20:25:28.7841943Z 
2025-05-07T20:25:28.8666252Z libcufft-11.3.3.41   | 147.4 MB  | #2         |  13% [A[A[A[A
2025-05-07T20:25:28.8666640Z 
2025-05-07T20:25:28.8666646Z 
2025-05-07T20:25:28.8668672Z 
2025-05-07T20:25:28.8671519Z libcusolver-11.7.2.5 | 156.9 MB  | #6         |  16% [A[A[A
2025-05-07T20:25:28.8672274Z 
2025-05-07T20:25:28.8710089Z nsight-compute-2025. | 320.6 MB  | 7          |   8% [A
2025-05-07T20:25:28.8741318Z libcublas-12.8.3.14  | 460.2 MB  | 5          |   5% 
2025-05-07T20:25:28.8741678Z 
2025-05-07T20:25:28.8741683Z 
2025-05-07T20:25:28.8843237Z libcusparse-12.5.7.5 | 164.9 MB  | #4         |  15% [A[A
2025-05-07T20:25:28.8843643Z 
2025-05-07T20:25:28.8843920Z 
2025-05-07T20:25:28.8843926Z 
2025-05-07T20:25:28.8843932Z 
2025-05-07T20:25:28.9669430Z libcufft-11.3.3.41   | 147.4 MB  | #5         |  15% [A[A[A[A
2025-05-07T20:25:28.9669892Z 
2025-05-07T20:25:28.9669896Z 
2025-05-07T20:25:28.9671229Z 
2025-05-07T20:25:28.9710031Z libcusolver-11.7.2.5 | 156.9 MB  | #8         |  19% [A[A[A
2025-05-07T20:25:28.9710430Z 
2025-05-07T20:25:28.9742213Z nsight-compute-2025. | 320.6 MB  | 8          |   9% [A
2025-05-07T20:25:28.9742599Z 
2025-05-07T20:25:28.9746730Z 
2025-05-07T20:25:28.9749251Z libcusparse-12.5.7.5 | 164.9 MB  | #7         |  17% [A[A
2025-05-07T20:25:28.9843250Z libcublas-12.8.3.14  | 460.2 MB  | 5          |   6% 
2025-05-07T20:25:28.9843587Z 
2025-05-07T20:25:28.9843598Z 
2025-05-07T20:25:28.9843604Z 
2025-05-07T20:25:28.9847011Z 
2025-05-07T20:25:29.0669663Z libcufft-11.3.3.41   | 147.4 MB  | #7         |  17% [A[A[A[A
2025-05-07T20:25:29.0670055Z 
2025-05-07T20:25:29.0670328Z 
2025-05-07T20:25:29.0670815Z 
2025-05-07T20:25:29.0748916Z libcusolver-11.7.2.5 | 156.9 MB  | ##1        |  21% [A[A[A
2025-05-07T20:25:29.0839365Z libcublas-12.8.3.14  | 460.2 MB  | 6          |   7% 
2025-05-07T20:25:29.0839686Z 
2025-05-07T20:25:29.0840394Z 
2025-05-07T20:25:29.0848391Z libcusparse-12.5.7.5 | 164.9 MB  | #9         |  19% [A[A
2025-05-07T20:25:29.0848742Z 
2025-05-07T20:25:29.0848748Z 
2025-05-07T20:25:29.0848754Z 
2025-05-07T20:25:29.0848759Z 
2025-05-07T20:25:29.0855666Z libcufft-11.3.3.41   | 147.4 MB  | #9         |  19% [A[A[A[A
2025-05-07T20:25:29.0859663Z 
2025-05-07T20:25:29.1674100Z nsight-compute-2025. | 320.6 MB  | 9          |  10% [A
2025-05-07T20:25:29.1674398Z 
2025-05-07T20:25:29.1674402Z 
2025-05-07T20:25:29.1675057Z 
2025-05-07T20:25:29.1777826Z libcusolver-11.7.2.5 | 156.9 MB  | ##3        |  23% [A[A[A
2025-05-07T20:25:29.1841515Z libcublas-12.8.3.14  | 460.2 MB  | 7          |   8% 
2025-05-07T20:25:29.1841792Z 
2025-05-07T20:25:29.1844800Z 
2025-05-07T20:25:29.1855863Z libcusparse-12.5.7.5 | 164.9 MB  | ##         |  21% [A[A
2025-05-07T20:25:29.1856161Z 
2025-05-07T20:25:29.1856165Z 
2025-05-07T20:25:29.1856169Z 
2025-05-07T20:25:29.1856781Z 
2025-05-07T20:25:29.1881889Z libcufft-11.3.3.41   | 147.4 MB  | ##1        |  22% [A[A[A[A
2025-05-07T20:25:29.1887034Z 
2025-05-07T20:25:29.2678018Z nsight-compute-2025. | 320.6 MB  | #          |  11% [A
2025-05-07T20:25:29.2678429Z 
2025-05-07T20:25:29.2678434Z 
2025-05-07T20:25:29.2679822Z 
2025-05-07T20:25:29.2847300Z libcusolver-11.7.2.5 | 156.9 MB  | ##5        |  26% [A[A[A
2025-05-07T20:25:29.2847827Z 
2025-05-07T20:25:29.2847845Z 
2025-05-07T20:25:29.2855711Z libcusparse-12.5.7.5 | 164.9 MB  | ##2        |  23% [A[A
2025-05-07T20:25:29.2855999Z 
2025-05-07T20:25:29.2856003Z 
2025-05-07T20:25:29.2856007Z 
2025-05-07T20:25:29.2857182Z 
2025-05-07T20:25:29.2874582Z libcufft-11.3.3.41   | 147.4 MB  | ##3        |  24% [A[A[A[A
2025-05-07T20:25:29.2931990Z libcublas-12.8.3.14  | 460.2 MB  | 8          |   8% 
2025-05-07T20:25:29.2934330Z 
2025-05-07T20:25:29.3882169Z nsight-compute-2025. | 320.6 MB  | #1         |  12% [A
2025-05-07T20:25:29.3883136Z libcublas-12.8.3.14  | 460.2 MB  | 9          |   9% 
2025-05-07T20:25:29.3883482Z 
2025-05-07T20:25:29.3886483Z 
2025-05-07T20:25:29.3887559Z libcusparse-12.5.7.5 | 164.9 MB  | ##4        |  25% [A[A
2025-05-07T20:25:29.3887906Z 
2025-05-07T20:25:29.3887917Z 
2025-05-07T20:25:29.3889645Z 
2025-05-07T20:25:29.3938472Z libcusolver-11.7.2.5 | 156.9 MB  | ##8        |  28% [A[A[A
2025-05-07T20:25:29.3938774Z 
2025-05-07T20:25:29.3938778Z 
2025-05-07T20:25:29.3938789Z 
2025-05-07T20:25:29.3940129Z 
2025-05-07T20:25:29.3973227Z libcufft-11.3.3.41   | 147.4 MB  | ##6        |  26% [A[A[A[A
2025-05-07T20:25:29.3976030Z 
2025-05-07T20:25:29.4927578Z nsight-compute-2025. | 320.6 MB  | #2         |  13% [A
2025-05-07T20:25:29.4927864Z 
2025-05-07T20:25:29.4927868Z 
2025-05-07T20:25:29.4984999Z libcusparse-12.5.7.5 | 164.9 MB  | ##6        |  27% [A[A
2025-05-07T20:25:29.4985317Z 
2025-05-07T20:25:29.4985787Z 
2025-05-07T20:25:29.4988156Z 
2025-05-07T20:25:29.4992063Z libcusolver-11.7.2.5 | 156.9 MB  | ###        |  30% [A[A[A
2025-05-07T20:25:29.4992360Z 
2025-05-07T20:25:29.4992364Z 
2025-05-07T20:25:29.4992368Z 
2025-05-07T20:25:29.4993132Z 
2025-05-07T20:25:29.5000551Z libcufft-11.3.3.41   | 147.4 MB  | ##8        |  28% [A[A[A[A
2025-05-07T20:25:29.5088402Z libcublas-12.8.3.14  | 460.2 MB  | 9          |  10% 
2025-05-07T20:25:29.5090504Z 
2025-05-07T20:25:29.5927326Z nsight-compute-2025. | 320.6 MB  | #3         |  14% [A
2025-05-07T20:25:29.5927638Z 
2025-05-07T20:25:29.5927642Z 
2025-05-07T20:25:29.5994818Z libcusparse-12.5.7.5 | 164.9 MB  | ##8        |  29% [A[A
2025-05-07T20:25:29.5995104Z 
2025-05-07T20:25:29.5995108Z 
2025-05-07T20:25:29.5995112Z 
2025-05-07T20:25:29.5995583Z 
2025-05-07T20:25:29.6044986Z libcufft-11.3.3.41   | 147.4 MB  | ###        |  30% [A[A[A[A
2025-05-07T20:25:29.6045396Z 
2025-05-07T20:25:29.6045401Z 
2025-05-07T20:25:29.6045998Z 
2025-05-07T20:25:29.6061660Z libcusolver-11.7.2.5 | 156.9 MB  | ###2       |  33% [A[A[A
2025-05-07T20:25:29.6088879Z libcublas-12.8.3.14  | 460.2 MB  | #          |  11% 
2025-05-07T20:25:29.6090738Z 
2025-05-07T20:25:29.6932230Z nsight-compute-2025. | 320.6 MB  | #5         |  15% [A
2025-05-07T20:25:29.6932644Z 
2025-05-07T20:25:29.6934104Z 
2025-05-07T20:25:29.7003389Z libcusparse-12.5.7.5 | 164.9 MB  | ###1       |  31% [A[A
2025-05-07T20:25:29.7004083Z 
2025-05-07T20:25:29.7004086Z 
2025-05-07T20:25:29.7004090Z 
2025-05-07T20:25:29.7004094Z 
2025-05-07T20:25:29.7082001Z libcufft-11.3.3.41   | 147.4 MB  | ###2       |  33% [A[A[A[A
2025-05-07T20:25:29.7082297Z 
2025-05-07T20:25:29.7082301Z 
2025-05-07T20:25:29.7082305Z 
2025-05-07T20:25:29.7089988Z libcusolver-11.7.2.5 | 156.9 MB  | ###4       |  35% [A[A[A
2025-05-07T20:25:29.7090434Z 
2025-05-07T20:25:29.7543672Z nsight-compute-2025. | 320.6 MB  | #6         |  16% [A
2025-05-07T20:25:29.7934123Z libcublas-12.8.3.14  | 460.2 MB  | #1         |  11% 
2025-05-07T20:25:29.7934416Z 
2025-05-07T20:25:29.7934421Z 
2025-05-07T20:25:29.8004076Z libcusparse-12.5.7.5 | 164.9 MB  | ###4       |  35% [A[A
2025-05-07T20:25:29.8004411Z 
2025-05-07T20:25:29.8004417Z 
2025-05-07T20:25:29.8004422Z 
2025-05-07T20:25:29.8004427Z 
2025-05-07T20:25:29.8086296Z libcufft-11.3.3.41   | 147.4 MB  | ###5       |  36% [A[A[A[A
2025-05-07T20:25:29.8086650Z 
2025-05-07T20:25:29.8086655Z 
2025-05-07T20:25:29.8086659Z 
2025-05-07T20:25:29.8091625Z libcusolver-11.7.2.5 | 156.9 MB  | ###8       |  38% [A[A[A
2025-05-07T20:25:29.8092026Z 
2025-05-07T20:25:29.8615348Z nsight-compute-2025. | 320.6 MB  | #7         |  17% [A
2025-05-07T20:25:29.9007798Z libcublas-12.8.3.14  | 460.2 MB  | #1         |  12% 
2025-05-07T20:25:29.9008087Z 
2025-05-07T20:25:29.9008092Z 
2025-05-07T20:25:29.9008095Z 
2025-05-07T20:25:29.9008099Z 
2025-05-07T20:25:29.9093760Z libcufft-11.3.3.41   | 147.4 MB  | ###8       |  38% [A[A[A[A
2025-05-07T20:25:29.9094110Z 
2025-05-07T20:25:29.9129436Z nsight-compute-2025. | 320.6 MB  | #8         |  18% [A
2025-05-07T20:25:29.9129722Z 
2025-05-07T20:25:29.9130256Z 
2025-05-07T20:25:29.9222090Z libcusparse-12.5.7.5 | 164.9 MB  | ###7       |  37% [A[A
2025-05-07T20:25:29.9222500Z 
2025-05-07T20:25:29.9222507Z 
2025-05-07T20:25:29.9223218Z 
2025-05-07T20:25:29.9620297Z libcusolver-11.7.2.5 | 156.9 MB  | ####1      |  41% [A[A[A
2025-05-07T20:25:30.0009902Z libcublas-12.8.3.14  | 460.2 MB  | #2         |  13% 
2025-05-07T20:25:30.0010226Z 
2025-05-07T20:25:30.0010231Z 
2025-05-07T20:25:30.0010235Z 
2025-05-07T20:25:30.0010238Z 
2025-05-07T20:25:30.0097220Z libcufft-11.3.3.41   | 147.4 MB  | ####       |  41% [A[A[A[A
2025-05-07T20:25:30.0097517Z 
2025-05-07T20:25:30.0189472Z nsight-compute-2025. | 320.6 MB  | #9         |  20% [A
2025-05-07T20:25:30.0189936Z 
2025-05-07T20:25:30.0189940Z 
2025-05-07T20:25:30.0236244Z libcusparse-12.5.7.5 | 164.9 MB  | ###9       |  40% [A[A
2025-05-07T20:25:30.0236616Z 
2025-05-07T20:25:30.0236635Z 
2025-05-07T20:25:30.0236871Z 
2025-05-07T20:25:30.0623751Z libcusolver-11.7.2.5 | 156.9 MB  | ####3      |  44% [A[A[A
2025-05-07T20:25:30.1101435Z libcublas-12.8.3.14  | 460.2 MB  | #3         |  13% 
2025-05-07T20:25:30.1101748Z 
2025-05-07T20:25:30.1191667Z nsight-compute-2025. | 320.6 MB  | ##         |  21% [A
2025-05-07T20:25:30.1192020Z 
2025-05-07T20:25:30.1192027Z 
2025-05-07T20:25:30.1253817Z libcusparse-12.5.7.5 | 164.9 MB  | ####1      |  42% [A[A
2025-05-07T20:25:30.1254150Z 
2025-05-07T20:25:30.1254155Z 
2025-05-07T20:25:30.1255769Z 
2025-05-07T20:25:30.1312324Z libcusolver-11.7.2.5 | 156.9 MB  | ####6      |  46% [A[A[A
2025-05-07T20:25:30.1312630Z 
2025-05-07T20:25:30.1312634Z 
2025-05-07T20:25:30.1312638Z 
2025-05-07T20:25:30.1313128Z 
2025-05-07T20:25:30.1629472Z libcufft-11.3.3.41   | 147.4 MB  | ####3      |  43% [A[A[A[A
2025-05-07T20:25:30.2103726Z libcublas-12.8.3.14  | 460.2 MB  | #4         |  14% 
2025-05-07T20:25:30.2104249Z 
2025-05-07T20:25:30.2192771Z nsight-compute-2025. | 320.6 MB  | ##1        |  22% [A
2025-05-07T20:25:30.2193289Z 
2025-05-07T20:25:30.2193871Z 
2025-05-07T20:25:30.2255982Z libcusparse-12.5.7.5 | 164.9 MB  | ####4      |  45% [A[A
2025-05-07T20:25:30.2256389Z 
2025-05-07T20:25:30.2256394Z 
2025-05-07T20:25:30.2257804Z 
2025-05-07T20:25:30.2634350Z libcusolver-11.7.2.5 | 156.9 MB  | ####8      |  49% [A[A[A
2025-05-07T20:25:30.3025012Z libcublas-12.8.3.14  | 460.2 MB  | #5         |  15% 
2025-05-07T20:25:30.3025572Z 
2025-05-07T20:25:30.3025577Z 
2025-05-07T20:25:30.3025581Z 
2025-05-07T20:25:30.3027787Z 
2025-05-07T20:25:30.3105491Z libcufft-11.3.3.41   | 147.4 MB  | ####5      |  45% [A[A[A[A
2025-05-07T20:25:30.3107402Z 
2025-05-07T20:25:30.3193930Z nsight-compute-2025. | 320.6 MB  | ##3        |  23% [A
2025-05-07T20:25:30.3194214Z 
2025-05-07T20:25:30.3195665Z 
2025-05-07T20:25:30.3323805Z libcusparse-12.5.7.5 | 164.9 MB  | ####6      |  47% [A[A
2025-05-07T20:25:30.3324136Z 
2025-05-07T20:25:30.3324140Z 
2025-05-07T20:25:30.3324604Z 
2025-05-07T20:25:30.3637253Z libcusolver-11.7.2.5 | 156.9 MB  | #####1     |  51% [A[A[A
2025-05-07T20:25:30.4029197Z libcublas-12.8.3.14  | 460.2 MB  | #5         |  16% 
2025-05-07T20:25:30.4029587Z 
2025-05-07T20:25:30.4029594Z 
2025-05-07T20:25:30.4029600Z 
2025-05-07T20:25:30.4033907Z 
2025-05-07T20:25:30.4167101Z libcufft-11.3.3.41   | 147.4 MB  | ####7      |  48% [A[A[A[A
2025-05-07T20:25:30.4173597Z 
2025-05-07T20:25:30.4364432Z nsight-compute-2025. | 320.6 MB  | ##4        |  24% [A
2025-05-07T20:25:30.4364725Z 
2025-05-07T20:25:30.4365939Z 
2025-05-07T20:25:30.4474958Z libcusparse-12.5.7.5 | 164.9 MB  | ####9      |  49% [A[A
2025-05-07T20:25:30.4475313Z 
2025-05-07T20:25:30.4475318Z 
2025-05-07T20:25:30.4476630Z 
2025-05-07T20:25:30.4663760Z libcusolver-11.7.2.5 | 156.9 MB  | #####3     |  54% [A[A[A
2025-05-07T20:25:30.5031148Z libcublas-12.8.3.14  | 460.2 MB  | #6         |  17% 
2025-05-07T20:25:30.5031428Z 
2025-05-07T20:25:30.5031433Z 
2025-05-07T20:25:30.5031436Z 
2025-05-07T20:25:30.5031817Z 
2025-05-07T20:25:30.5219453Z libcufft-11.3.3.41   | 147.4 MB  | ####9      |  50% [A[A[A[A
2025-05-07T20:25:30.5219759Z 
2025-05-07T20:25:30.5419967Z nsight-compute-2025. | 320.6 MB  | ##5        |  25% [A
2025-05-07T20:25:30.5420315Z 
2025-05-07T20:25:30.5420319Z 
2025-05-07T20:25:30.5667937Z libcusparse-12.5.7.5 | 164.9 MB  | #####1     |  52% [A[A
2025-05-07T20:25:30.6033847Z libcublas-12.8.3.14  | 460.2 MB  | #7         |  18% 
2025-05-07T20:25:30.6034111Z 
2025-05-07T20:25:30.6034116Z 
2025-05-07T20:25:30.6034119Z 
2025-05-07T20:25:30.6034124Z 
2025-05-07T20:25:30.6045028Z libcufft-11.3.3.41   | 147.4 MB  | #####2     |  52% [A[A[A[A
2025-05-07T20:25:30.6045448Z 
2025-05-07T20:25:30.6045453Z 
2025-05-07T20:25:30.6047265Z 
2025-05-07T20:25:30.6220695Z libcusolver-11.7.2.5 | 156.9 MB  | #####6     |  56% [A[A[A
2025-05-07T20:25:30.6223441Z 
2025-05-07T20:25:30.6485668Z nsight-compute-2025. | 320.6 MB  | ##6        |  27% [A
2025-05-07T20:25:30.6485984Z 
2025-05-07T20:25:30.6486009Z 
2025-05-07T20:25:30.6792448Z libcusparse-12.5.7.5 | 164.9 MB  | #####4     |  54% [A[A
2025-05-07T20:25:30.7046832Z libcublas-12.8.3.14  | 460.2 MB  | #8         |  18% 
2025-05-07T20:25:30.7047104Z 
2025-05-07T20:25:30.7047108Z 
2025-05-07T20:25:30.7051331Z 
2025-05-07T20:25:30.7087457Z libcusolver-11.7.2.5 | 156.9 MB  | #####8     |  58% [A[A[A
2025-05-07T20:25:30.7088102Z 
2025-05-07T20:25:30.7088119Z 
2025-05-07T20:25:30.7088125Z 
2025-05-07T20:25:30.7088131Z 
2025-05-07T20:25:30.7231375Z libcufft-11.3.3.41   | 147.4 MB  | #####4     |  55% [A[A[A[A
2025-05-07T20:25:30.7231761Z 
2025-05-07T20:25:30.7689035Z nsight-compute-2025. | 320.6 MB  | ##7        |  28% [A
2025-05-07T20:25:30.7689381Z 
2025-05-07T20:25:30.7690564Z 
2025-05-07T20:25:30.7830480Z libcusparse-12.5.7.5 | 164.9 MB  | #####6     |  56% [A[A
2025-05-07T20:25:30.8051988Z libcublas-12.8.3.14  | 460.2 MB  | #9         |  19% 
2025-05-07T20:25:30.8052337Z 
2025-05-07T20:25:30.8052342Z 
2025-05-07T20:25:30.8053719Z 
2025-05-07T20:25:30.8104389Z libcusolver-11.7.2.5 | 156.9 MB  | ######     |  61% [A[A[A
2025-05-07T20:25:30.8104701Z 
2025-05-07T20:25:30.8104707Z 
2025-05-07T20:25:30.8104712Z 
2025-05-07T20:25:30.8104924Z 
2025-05-07T20:25:30.8260486Z libcufft-11.3.3.41   | 147.4 MB  | #####6     |  57% [A[A[A[A
2025-05-07T20:25:30.8263097Z 
2025-05-07T20:25:30.8697407Z nsight-compute-2025. | 320.6 MB  | ##8        |  29% [A
2025-05-07T20:25:30.8697756Z 
2025-05-07T20:25:30.8700247Z 
2025-05-07T20:25:30.8843686Z libcusparse-12.5.7.5 | 164.9 MB  | #####8     |  58% [A[A
2025-05-07T20:25:30.9052126Z libcublas-12.8.3.14  | 460.2 MB  | #9         |  20% 
2025-05-07T20:25:30.9052395Z 
2025-05-07T20:25:30.9052400Z 
2025-05-07T20:25:30.9057624Z 
2025-05-07T20:25:30.9106124Z libcusolver-11.7.2.5 | 156.9 MB  | ######2    |  63% [A[A[A
2025-05-07T20:25:30.9106513Z 
2025-05-07T20:25:30.9106518Z 
2025-05-07T20:25:30.9106522Z 
2025-05-07T20:25:30.9107151Z 
2025-05-07T20:25:30.9300645Z libcufft-11.3.3.41   | 147.4 MB  | #####9     |  59% [A[A[A[A
2025-05-07T20:25:30.9301650Z 
2025-05-07T20:25:30.9698474Z nsight-compute-2025. | 320.6 MB  | ##9        |  30% [A
2025-05-07T20:25:30.9698765Z 
2025-05-07T20:25:30.9698770Z 
2025-05-07T20:25:30.9845447Z libcusparse-12.5.7.5 | 164.9 MB  | ######     |  61% [A[A
2025-05-07T20:25:31.0052578Z libcublas-12.8.3.14  | 460.2 MB  | ##         |  21% 
2025-05-07T20:25:31.0053269Z 
2025-05-07T20:25:31.0053273Z 
2025-05-07T20:25:31.0053858Z 
2025-05-07T20:25:31.0109820Z libcusolver-11.7.2.5 | 156.9 MB  | ######5    |  65% [A[A[A
2025-05-07T20:25:31.0110145Z 
2025-05-07T20:25:31.0110151Z 
2025-05-07T20:25:31.0110156Z 
2025-05-07T20:25:31.0110161Z 
2025-05-07T20:25:31.0304395Z libcufft-11.3.3.41   | 147.4 MB  | ######1    |  62% [A[A[A[A
2025-05-07T20:25:31.0304696Z 
2025-05-07T20:25:31.0701953Z nsight-compute-2025. | 320.6 MB  | ###1       |  31% [A
2025-05-07T20:25:31.0702248Z 
2025-05-07T20:25:31.0702873Z 
2025-05-07T20:25:31.0851796Z libcusparse-12.5.7.5 | 164.9 MB  | ######2    |  63% [A[A
2025-05-07T20:25:31.1054562Z libcublas-12.8.3.14  | 460.2 MB  | ##1        |  21% 
2025-05-07T20:25:31.1054876Z 
2025-05-07T20:25:31.1054881Z 
2025-05-07T20:25:31.1056363Z 
2025-05-07T20:25:31.1126643Z libcusolver-11.7.2.5 | 156.9 MB  | ######7    |  67% [A[A[A
2025-05-07T20:25:31.1127025Z 
2025-05-07T20:25:31.1127031Z 
2025-05-07T20:25:31.1127037Z 
2025-05-07T20:25:31.1127042Z 
2025-05-07T20:25:31.1495314Z libcufft-11.3.3.41   | 147.4 MB  | ######4    |  64% [A[A[A[A
2025-05-07T20:25:31.1495684Z 
2025-05-07T20:25:31.1727012Z nsight-compute-2025. | 320.6 MB  | ###2       |  32% [A
2025-05-07T20:25:31.1727389Z 
2025-05-07T20:25:31.1727394Z 
2025-05-07T20:25:31.2007640Z libcusparse-12.5.7.5 | 164.9 MB  | ######4    |  65% [A[A
2025-05-07T20:25:31.2084717Z libcublas-12.8.3.14  | 460.2 MB  | ##2        |  22% 
2025-05-07T20:25:31.2084996Z 
2025-05-07T20:25:31.2085000Z 
2025-05-07T20:25:31.2085004Z 
2025-05-07T20:25:31.2248752Z libcusolver-11.7.2.5 | 156.9 MB  | ######9    |  70% [A[A[A
2025-05-07T20:25:31.2249085Z 
2025-05-07T20:25:31.2249349Z 
2025-05-07T20:25:31.2249354Z 
2025-05-07T20:25:31.2254543Z 
2025-05-07T20:25:31.2702885Z libcufft-11.3.3.41   | 147.4 MB  | ######6    |  66% [A[A[A[A
2025-05-07T20:25:31.2704008Z 
2025-05-07T20:25:31.2730751Z nsight-compute-2025. | 320.6 MB  | ###3       |  33% [A
2025-05-07T20:25:31.2731068Z 
2025-05-07T20:25:31.2731072Z 
2025-05-07T20:25:31.3011232Z libcusparse-12.5.7.5 | 164.9 MB  | ######7    |  67% [A[A
2025-05-07T20:25:31.3136636Z libcublas-12.8.3.14  | 460.2 MB  | ##2        |  23% 
2025-05-07T20:25:31.3136930Z 
2025-05-07T20:25:31.3136935Z 
2025-05-07T20:25:31.3138226Z 
2025-05-07T20:25:31.3369490Z libcusolver-11.7.2.5 | 156.9 MB  | #######1   |  72% [A[A[A
2025-05-07T20:25:31.3369794Z 
2025-05-07T20:25:31.3369799Z 
2025-05-07T20:25:31.3369803Z 
2025-05-07T20:25:31.3372201Z 
2025-05-07T20:25:31.3732268Z libcufft-11.3.3.41   | 147.4 MB  | ######8    |  69% [A[A[A[A
2025-05-07T20:25:31.3732563Z 
2025-05-07T20:25:31.3732574Z 
2025-05-07T20:25:31.3770241Z libcusparse-12.5.7.5 | 164.9 MB  | ######9    |  69% [A[A
2025-05-07T20:25:31.3771905Z 
2025-05-07T20:25:31.4014417Z nsight-compute-2025. | 320.6 MB  | ###4       |  34% [A
2025-05-07T20:25:31.4138289Z libcublas-12.8.3.14  | 460.2 MB  | ##3        |  24% 
2025-05-07T20:25:31.4138557Z 
2025-05-07T20:25:31.4138561Z 
2025-05-07T20:25:31.4139314Z 
2025-05-07T20:25:31.4733911Z libcusolver-11.7.2.5 | 156.9 MB  | #######4   |  74% [A[A[A
2025-05-07T20:25:31.4734218Z 
2025-05-07T20:25:31.4734228Z 
2025-05-07T20:25:31.4772961Z libcusparse-12.5.7.5 | 164.9 MB  | #######1   |  72% [A[A
2025-05-07T20:25:31.4775206Z 
2025-05-07T20:25:31.5016522Z nsight-compute-2025. | 320.6 MB  | ###5       |  36% [A
2025-05-07T20:25:31.5108520Z libcublas-12.8.3.14  | 460.2 MB  | ##4        |  25% 
2025-05-07T20:25:31.5108780Z 
2025-05-07T20:25:31.5108785Z 
2025-05-07T20:25:31.5108788Z 
2025-05-07T20:25:31.5108793Z 
2025-05-07T20:25:31.5140817Z libcufft-11.3.3.41   | 147.4 MB  | #######    |  71% [A[A[A[A
2025-05-07T20:25:31.5141151Z 
2025-05-07T20:25:31.5141155Z 
2025-05-07T20:25:31.5141887Z 
2025-05-07T20:25:31.5836703Z libcusolver-11.7.2.5 | 156.9 MB  | #######6   |  77% [A[A[A
2025-05-07T20:25:31.5837022Z 
2025-05-07T20:25:31.5837026Z 
2025-05-07T20:25:31.5871258Z libcusparse-12.5.7.5 | 164.9 MB  | #######4   |  74% [A[A
2025-05-07T20:25:31.5872195Z 
2025-05-07T20:25:31.6076509Z nsight-compute-2025. | 320.6 MB  | ###6       |  37% [A
2025-05-07T20:25:31.6114970Z libcublas-12.8.3.14  | 460.2 MB  | ##5        |  25% 
2025-05-07T20:25:31.6115267Z 
2025-05-07T20:25:31.6115271Z 
2025-05-07T20:25:31.6115275Z 
2025-05-07T20:25:31.6116387Z 
2025-05-07T20:25:31.6212007Z libcufft-11.3.3.41   | 147.4 MB  | #######2   |  73% [A[A[A[A
2025-05-07T20:25:31.6212304Z 
2025-05-07T20:25:31.6212308Z 
2025-05-07T20:25:31.6212312Z 
2025-05-07T20:25:31.6902744Z libcusolver-11.7.2.5 | 156.9 MB  | #######8   |  79% [A[A[A
2025-05-07T20:25:31.6903048Z 
2025-05-07T20:25:31.6903053Z 
2025-05-07T20:25:31.6954090Z libcusparse-12.5.7.5 | 164.9 MB  | #######6   |  76% [A[A
2025-05-07T20:25:31.6954875Z 
2025-05-07T20:25:31.7077836Z nsight-compute-2025. | 320.6 MB  | ###7       |  38% [A
2025-05-07T20:25:31.7212421Z libcublas-12.8.3.14  | 460.2 MB  | ##6        |  26% 
2025-05-07T20:25:31.7212759Z 
2025-05-07T20:25:31.7212764Z 
2025-05-07T20:25:31.7212768Z 
2025-05-07T20:25:31.7694387Z libcusolver-11.7.2.5 | 156.9 MB  | ########1  |  81% [A[A[A
2025-05-07T20:25:31.7694715Z 
2025-05-07T20:25:31.7694721Z 
2025-05-07T20:25:31.7694742Z 
2025-05-07T20:25:31.7698471Z 
2025-05-07T20:25:31.7904984Z libcufft-11.3.3.41   | 147.4 MB  | #######4   |  75% [A[A[A[A
2025-05-07T20:25:31.7905459Z 
2025-05-07T20:25:31.7905465Z 
2025-05-07T20:25:31.7955816Z libcusparse-12.5.7.5 | 164.9 MB  | #######8   |  79% [A[A
2025-05-07T20:25:31.7958764Z 
2025-05-07T20:25:31.8079688Z nsight-compute-2025. | 320.6 MB  | ###8       |  39% [A
2025-05-07T20:25:31.8217627Z libcublas-12.8.3.14  | 460.2 MB  | ##7        |  27% 
2025-05-07T20:25:31.8217937Z 
2025-05-07T20:25:31.8218242Z 
2025-05-07T20:25:31.8218249Z 
2025-05-07T20:25:31.8697572Z libcusolver-11.7.2.5 | 156.9 MB  | ########3  |  83% [A[A[A
2025-05-07T20:25:31.8697996Z 
2025-05-07T20:25:31.8698003Z 
2025-05-07T20:25:31.8698009Z 
2025-05-07T20:25:31.8698014Z 
2025-05-07T20:25:31.8969609Z libcufft-11.3.3.41   | 147.4 MB  | #######6   |  77% [A[A[A[A
2025-05-07T20:25:31.8969997Z 
2025-05-07T20:25:31.8970002Z 
2025-05-07T20:25:31.9071456Z libcusparse-12.5.7.5 | 164.9 MB  | ########1  |  81% [A[A
2025-05-07T20:25:31.9072580Z 
2025-05-07T20:25:31.9221142Z nsight-compute-2025. | 320.6 MB  | ###9       |  40% [A
2025-05-07T20:25:31.9563545Z libcublas-12.8.3.14  | 460.2 MB  | ##7        |  28% 
2025-05-07T20:25:31.9563821Z 
2025-05-07T20:25:31.9563825Z 
2025-05-07T20:25:31.9563829Z 
2025-05-07T20:25:31.9697579Z libcusolver-11.7.2.5 | 156.9 MB  | ########5  |  86% [A[A[A
2025-05-07T20:25:31.9697881Z 
2025-05-07T20:25:31.9697885Z 
2025-05-07T20:25:31.9697889Z 
2025-05-07T20:25:31.9697893Z 
2025-05-07T20:25:32.0024067Z libcufft-11.3.3.41   | 147.4 MB  | #######8   |  79% [A[A[A[A
2025-05-07T20:25:32.0024391Z 
2025-05-07T20:25:32.0028718Z 
2025-05-07T20:25:32.0186032Z libcusparse-12.5.7.5 | 164.9 MB  | ########3  |  83% [A[A
2025-05-07T20:25:32.0186404Z 
2025-05-07T20:25:32.0261636Z nsight-compute-2025. | 320.6 MB  | ####1      |  41% [A
2025-05-07T20:25:32.0699605Z libcublas-12.8.3.14  | 460.2 MB  | ##8        |  29% 
2025-05-07T20:25:32.0699930Z 
2025-05-07T20:25:32.0699942Z 
2025-05-07T20:25:32.0699946Z 
2025-05-07T20:25:32.0699950Z 
2025-05-07T20:25:32.0711283Z libcufft-11.3.3.41   | 147.4 MB  | ########1  |  81% [A[A[A[A
2025-05-07T20:25:32.0711632Z 
2025-05-07T20:25:32.0711637Z 
2025-05-07T20:25:32.0712139Z 
2025-05-07T20:25:32.1073220Z libcusolver-11.7.2.5 | 156.9 MB  | ########7  |  88% [A[A[A
2025-05-07T20:25:32.1073621Z 
2025-05-07T20:25:32.1076296Z 
2025-05-07T20:25:32.1271430Z libcusparse-12.5.7.5 | 164.9 MB  | ########5  |  86% [A[A
2025-05-07T20:25:32.1274237Z 
2025-05-07T20:25:32.1303696Z nsight-compute-2025. | 320.6 MB  | ####2      |  42% [A
2025-05-07T20:25:32.1704634Z libcublas-12.8.3.14  | 460.2 MB  | ##9        |  29% 
2025-05-07T20:25:32.1704904Z 
2025-05-07T20:25:32.1704909Z 
2025-05-07T20:25:32.1704913Z 
2025-05-07T20:25:32.1705801Z 
2025-05-07T20:25:32.1930997Z libcufft-11.3.3.41   | 147.4 MB  | ########3  |  83% [A[A[A[A
2025-05-07T20:25:32.1931311Z 
2025-05-07T20:25:32.1931315Z 
2025-05-07T20:25:32.1931319Z 
2025-05-07T20:25:32.2078008Z libcusolver-11.7.2.5 | 156.9 MB  | ########9  |  90% [A[A[A
2025-05-07T20:25:32.2078466Z 
2025-05-07T20:25:32.2079106Z 
2025-05-07T20:25:32.2275357Z libcusparse-12.5.7.5 | 164.9 MB  | ########7  |  88% [A[A
2025-05-07T20:25:32.2275687Z 
2025-05-07T20:25:32.2307757Z nsight-compute-2025. | 320.6 MB  | ####3      |  43% [A
2025-05-07T20:25:32.2709147Z libcublas-12.8.3.14  | 460.2 MB  | ###        |  30% 
2025-05-07T20:25:32.2709474Z 
2025-05-07T20:25:32.2709478Z 
2025-05-07T20:25:32.2709482Z 
2025-05-07T20:25:32.2709920Z 
2025-05-07T20:25:32.2952750Z libcufft-11.3.3.41   | 147.4 MB  | ########5  |  86% [A[A[A[A
2025-05-07T20:25:32.2953072Z 
2025-05-07T20:25:32.2953078Z 
2025-05-07T20:25:32.2955368Z 
2025-05-07T20:25:32.3078014Z libcusolver-11.7.2.5 | 156.9 MB  | #########1 |  92% [A[A[A
2025-05-07T20:25:32.3078319Z 
2025-05-07T20:25:32.3078323Z 
2025-05-07T20:25:32.3284281Z libcusparse-12.5.7.5 | 164.9 MB  | #########  |  90% [A[A
2025-05-07T20:25:32.3284598Z 
2025-05-07T20:25:32.3310977Z nsight-compute-2025. | 320.6 MB  | ####4      |  44% [A
2025-05-07T20:25:32.3952558Z libcublas-12.8.3.14  | 460.2 MB  | ###        |  31% 
2025-05-07T20:25:32.3952863Z 
2025-05-07T20:25:32.3952869Z 
2025-05-07T20:25:32.3954808Z 
2025-05-07T20:25:32.3971741Z libcusolver-11.7.2.5 | 156.9 MB  | #########3 |  94% [A[A[A
2025-05-07T20:25:32.3972046Z 
2025-05-07T20:25:32.3972051Z 
2025-05-07T20:25:32.3972055Z 
2025-05-07T20:25:32.3972059Z 
2025-05-07T20:25:32.4150626Z libcufft-11.3.3.41   | 147.4 MB  | ########7  |  88% [A[A[A[A
2025-05-07T20:25:32.4151209Z 
2025-05-07T20:25:32.4152995Z 
2025-05-07T20:25:32.4293588Z libcusparse-12.5.7.5 | 164.9 MB  | #########2 |  92% [A[A
2025-05-07T20:25:32.4293971Z 
2025-05-07T20:25:32.4312206Z nsight-compute-2025. | 320.6 MB  | ####5      |  45% [A
2025-05-07T20:25:32.4979753Z libcublas-12.8.3.14  | 460.2 MB  | ###1       |  32% 
2025-05-07T20:25:32.4980091Z 
2025-05-07T20:25:32.4980097Z 
2025-05-07T20:25:32.4982671Z 
2025-05-07T20:25:32.4991964Z libcusolver-11.7.2.5 | 156.9 MB  | #########5 |  96% [A[A[A
2025-05-07T20:25:32.4992294Z 
2025-05-07T20:25:32.4992299Z 
2025-05-07T20:25:32.4992303Z 
2025-05-07T20:25:32.4992307Z 
2025-05-07T20:25:32.5237910Z libcufft-11.3.3.41   | 147.4 MB  | ########9  |  90% [A[A[A[A
2025-05-07T20:25:32.5238204Z 
2025-05-07T20:25:32.5238209Z 
2025-05-07T20:25:32.5298165Z libcusparse-12.5.7.5 | 164.9 MB  | #########4 |  94% [A[A
2025-05-07T20:25:32.5298547Z 
2025-05-07T20:25:32.5342731Z nsight-compute-2025. | 320.6 MB  | ####6      |  46% [A
2025-05-07T20:25:32.5993704Z libcublas-12.8.3.14  | 460.2 MB  | ###2       |  33% 
2025-05-07T20:25:32.5993979Z 
2025-05-07T20:25:32.5993992Z 
2025-05-07T20:25:32.5993996Z 
2025-05-07T20:25:32.5995392Z 
2025-05-07T20:25:32.6107799Z libcufft-11.3.3.41   | 147.4 MB  | #########1 |  92% [A[A[A[A
2025-05-07T20:25:32.6108099Z 
2025-05-07T20:25:32.6108109Z 
2025-05-07T20:25:32.6108113Z 
2025-05-07T20:25:32.6373644Z libcusolver-11.7.2.5 | 156.9 MB  | #########7 |  97% [A[A[A
2025-05-07T20:25:32.6374073Z 
2025-05-07T20:25:32.6374702Z 
2025-05-07T20:25:32.6427753Z libcusparse-12.5.7.5 | 164.9 MB  | #########6 |  96% [A[A
2025-05-07T20:25:32.6428125Z 
2025-05-07T20:25:32.6436234Z nsight-compute-2025. | 320.6 MB  | ####7      |  47% [A
2025-05-07T20:25:32.6994713Z libcublas-12.8.3.14  | 460.2 MB  | ###3       |  33% 
2025-05-07T20:25:32.6995094Z 
2025-05-07T20:25:32.6995101Z 
2025-05-07T20:25:32.6995108Z 
2025-05-07T20:25:32.6995113Z 
2025-05-07T20:25:32.7114014Z libcufft-11.3.3.41   | 147.4 MB  | #########4 |  94% [A[A[A[A
2025-05-07T20:25:32.7114348Z 
2025-05-07T20:25:32.7114352Z 
2025-05-07T20:25:32.7114356Z 
2025-05-07T20:25:32.7399045Z libcusolver-11.7.2.5 | 156.9 MB  | #########9 |  99% [A[A[A
2025-05-07T20:25:32.7399508Z 
2025-05-07T20:25:32.7400113Z 
2025-05-07T20:25:32.7438405Z libcusparse-12.5.7.5 | 164.9 MB  | #########8 |  98% [A[A
2025-05-07T20:25:32.7459296Z libcublas-12.8.3.14  | 460.2 MB  | ###4       |  34% 
2025-05-07T20:25:32.7461009Z 
2025-05-07T20:25:32.7996999Z nsight-compute-2025. | 320.6 MB  | ####8      |  48% [A
2025-05-07T20:25:32.7997393Z 
2025-05-07T20:25:32.7997400Z 
2025-05-07T20:25:32.7997406Z 
2025-05-07T20:25:32.7997411Z 
2025-05-07T20:25:32.8461802Z libcufft-11.3.3.41   | 147.4 MB  | #########6 |  97% [A[A[A[A
2025-05-07T20:25:32.8463060Z 
2025-05-07T20:25:32.8490938Z nsight-compute-2025. | 320.6 MB  | ####9      |  50% [A
2025-05-07T20:25:32.8998084Z libcublas-12.8.3.14  | 460.2 MB  | ###4       |  35% 
2025-05-07T20:25:32.8998373Z 
2025-05-07T20:25:32.8998377Z 
2025-05-07T20:25:32.8998411Z 
2025-05-07T20:25:32.8998424Z 
2025-05-07T20:25:32.9464274Z libcufft-11.3.3.41   | 147.4 MB  | #########9 |  99% [A[A[A[A
2025-05-07T20:25:32.9465895Z 
2025-05-07T20:25:32.9492917Z nsight-compute-2025. | 320.6 MB  | #####      |  51% [A
2025-05-07T20:25:33.0465621Z libcublas-12.8.3.14  | 460.2 MB  | ###5       |  36% 
2025-05-07T20:25:33.0468469Z 
2025-05-07T20:25:33.0495395Z nsight-compute-2025. | 320.6 MB  | #####2     |  52% [A
2025-05-07T20:25:33.1467767Z libcublas-12.8.3.14  | 460.2 MB  | ###7       |  37% 
2025-05-07T20:25:33.1468192Z 
2025-05-07T20:25:33.1892283Z nsight-compute-2025. | 320.6 MB  | #####4     |  54% [A
2025-05-07T20:25:33.2468382Z libcublas-12.8.3.14  | 460.2 MB  | ###8       |  38% 
2025-05-07T20:25:33.2468645Z 
2025-05-07T20:25:33.2950043Z nsight-compute-2025. | 320.6 MB  | #####5     |  56% [A
2025-05-07T20:25:33.3597718Z libcublas-12.8.3.14  | 460.2 MB  | ###8       |  39% 
2025-05-07T20:25:33.3598000Z 
2025-05-07T20:25:33.3959970Z nsight-compute-2025. | 320.6 MB  | #####7     |  57% [A
2025-05-07T20:25:33.4714139Z libcublas-12.8.3.14  | 460.2 MB  | ###9       |  40% 
2025-05-07T20:25:33.4714901Z 
2025-05-07T20:25:33.4961642Z nsight-compute-2025. | 320.6 MB  | #####8     |  58% [A
2025-05-07T20:25:33.5756629Z libcublas-12.8.3.14  | 460.2 MB  | ####       |  41% 
2025-05-07T20:25:33.5757539Z 
2025-05-07T20:25:33.5965073Z nsight-compute-2025. | 320.6 MB  | #####9     |  60% [A
2025-05-07T20:25:33.6850921Z libcublas-12.8.3.14  | 460.2 MB  | ####1      |  41% 
2025-05-07T20:25:33.6851412Z 
2025-05-07T20:25:33.6965752Z nsight-compute-2025. | 320.6 MB  | ######1    |  61% [A
2025-05-07T20:25:33.7852263Z libcublas-12.8.3.14  | 460.2 MB  | ####2      |  42% 
2025-05-07T20:25:33.7853665Z 
2025-05-07T20:25:33.8027190Z nsight-compute-2025. | 320.6 MB  | ######2    |  63% [A
2025-05-07T20:25:33.8853327Z libcublas-12.8.3.14  | 460.2 MB  | ####3      |  43% 
2025-05-07T20:25:33.8855002Z 
2025-05-07T20:25:33.9222880Z nsight-compute-2025. | 320.6 MB  | ######3    |  64% [A
2025-05-07T20:25:33.9857017Z libcublas-12.8.3.14  | 460.2 MB  | ####4      |  44% 
2025-05-07T20:25:33.9858673Z 
2025-05-07T20:25:34.0223453Z nsight-compute-2025. | 320.6 MB  | ######5    |  66% [A
2025-05-07T20:25:34.0893119Z libcublas-12.8.3.14  | 460.2 MB  | ####5      |  45% 
2025-05-07T20:25:34.0893499Z 
2025-05-07T20:25:34.1226534Z nsight-compute-2025. | 320.6 MB  | ######7    |  67% [A
2025-05-07T20:25:34.1926831Z libcublas-12.8.3.14  | 460.2 MB  | ####5      |  46% 
2025-05-07T20:25:34.1927170Z 
2025-05-07T20:25:34.2434860Z nsight-compute-2025. | 320.6 MB  | ######8    |  69% [A
2025-05-07T20:25:34.2928275Z libcublas-12.8.3.14  | 460.2 MB  | ####6      |  47% 
2025-05-07T20:25:34.2928548Z 
2025-05-07T20:25:34.3436778Z nsight-compute-2025. | 320.6 MB  | #######    |  70% [A
2025-05-07T20:25:34.3987148Z libcublas-12.8.3.14  | 460.2 MB  | ####7      |  48% 
2025-05-07T20:25:34.3987425Z 
2025-05-07T20:25:34.4661927Z nsight-compute-2025. | 320.6 MB  | #######2   |  72% [A
2025-05-07T20:25:34.4987467Z libcublas-12.8.3.14  | 460.2 MB  | ####8      |  49% 
2025-05-07T20:25:34.4989454Z 
2025-05-07T20:25:34.5663356Z nsight-compute-2025. | 320.6 MB  | #######3   |  74% [A
2025-05-07T20:25:34.5992050Z libcublas-12.8.3.14  | 460.2 MB  | ####9      |  50% 
2025-05-07T20:25:34.5992357Z 
2025-05-07T20:25:34.6663582Z nsight-compute-2025. | 320.6 MB  | #######5   |  75% [A
2025-05-07T20:25:34.7014443Z libcublas-12.8.3.14  | 460.2 MB  | #####      |  51% 
2025-05-07T20:25:34.7014710Z 
2025-05-07T20:25:34.7665544Z nsight-compute-2025. | 320.6 MB  | #######6   |  77% [A
2025-05-07T20:25:34.8133162Z libcublas-12.8.3.14  | 460.2 MB  | #####1     |  52% 
2025-05-07T20:25:34.8133821Z 
2025-05-07T20:25:34.8669551Z nsight-compute-2025. | 320.6 MB  | #######8   |  79% [A
2025-05-07T20:25:34.9228517Z libcublas-12.8.3.14  | 460.2 MB  | #####2     |  53% 
2025-05-07T20:25:34.9228855Z 
2025-05-07T20:25:34.9670342Z nsight-compute-2025. | 320.6 MB  | ########   |  80% [A
2025-05-07T20:25:35.0228955Z libcublas-12.8.3.14  | 460.2 MB  | #####3     |  54% 
2025-05-07T20:25:35.0229260Z 
2025-05-07T20:25:35.1247406Z nsight-compute-2025. | 320.6 MB  | ########1  |  82% [A
2025-05-07T20:25:35.1248329Z 
2025-05-07T20:25:35.1304804Z nsight-compute-2025. | 320.6 MB  | ########3  |  83% [A
2025-05-07T20:25:35.2247226Z libcublas-12.8.3.14  | 460.2 MB  | #####4     |  55% 
2025-05-07T20:25:35.2247509Z 
2025-05-07T20:25:35.2336123Z nsight-compute-2025. | 320.6 MB  | ########5  |  85% [A
2025-05-07T20:25:35.3312811Z libcublas-12.8.3.14  | 460.2 MB  | #####5     |  56% 
2025-05-07T20:25:35.3313119Z 
2025-05-07T20:25:35.3336672Z nsight-compute-2025. | 320.6 MB  | ########6  |  87% [A
2025-05-07T20:25:35.4328714Z libcublas-12.8.3.14  | 460.2 MB  | #####6     |  57% 
2025-05-07T20:25:35.4328996Z 
2025-05-07T20:25:35.4346172Z nsight-compute-2025. | 320.6 MB  | ########8  |  88% [A
2025-05-07T20:25:35.4821393Z libcublas-12.8.3.14  | 460.2 MB  | #####7     |  58% 
2025-05-07T20:25:35.4821759Z 
2025-05-07T20:25:35.4821764Z 
2025-05-07T20:25:35.4821768Z 
2025-05-07T20:25:35.5085215Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:35.5085535Z 
2025-05-07T20:25:35.5085539Z 
2025-05-07T20:25:35.5085543Z 
2025-05-07T20:25:35.5089080Z 
2025-05-07T20:25:35.5385882Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:35.5393942Z libcublas-12.8.3.14  | 460.2 MB  | #####8     |  59% 
2025-05-07T20:25:35.5394247Z 
2025-05-07T20:25:35.5394253Z 
2025-05-07T20:25:35.5394258Z 
2025-05-07T20:25:35.5394263Z 
2025-05-07T20:25:35.5402504Z 
2025-05-07T20:25:35.5697984Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:35.5698369Z 
2025-05-07T20:25:35.5739326Z nsight-compute-2025. | 320.6 MB  | #########  |  90% [A
2025-05-07T20:25:35.5739684Z 
2025-05-07T20:25:35.5739689Z 
2025-05-07T20:25:35.5739693Z 
2025-05-07T20:25:35.5739696Z 
2025-05-07T20:25:35.5739700Z 
2025-05-07T20:25:35.5740796Z 
2025-05-07T20:25:35.6398314Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:35.6399128Z 
2025-05-07T20:25:35.6399134Z 
2025-05-07T20:25:35.6399138Z 
2025-05-07T20:25:35.6399142Z 
2025-05-07T20:25:35.6399156Z 
2025-05-07T20:25:35.6741379Z libnpp-12.3.3.65     | 130.6 MB  | 2          |   2% [A[A[A[A[A
2025-05-07T20:25:35.6741821Z 
2025-05-07T20:25:35.6741827Z 
2025-05-07T20:25:35.6741833Z 
2025-05-07T20:25:35.6741839Z 
2025-05-07T20:25:35.6741858Z 
2025-05-07T20:25:35.6742709Z 
2025-05-07T20:25:35.6862316Z cuda-nsight-12.8.55  | 113.2 MB  | 2          |   3% [A[A[A[A[A[A
2025-05-07T20:25:35.7371269Z libcublas-12.8.3.14  | 460.2 MB  | #####9     |  60% 
2025-05-07T20:25:35.7372301Z 
2025-05-07T20:25:35.7406079Z nsight-compute-2025. | 320.6 MB  | #########1 |  92% [A
2025-05-07T20:25:35.7406366Z 
2025-05-07T20:25:35.7406371Z 
2025-05-07T20:25:35.7406374Z 
2025-05-07T20:25:35.7406378Z 
2025-05-07T20:25:35.7408944Z 
2025-05-07T20:25:35.7743696Z libnpp-12.3.3.65     | 130.6 MB  | 4          |   5% [A[A[A[A[A
2025-05-07T20:25:35.7744023Z 
2025-05-07T20:25:35.7744064Z 
2025-05-07T20:25:35.7744086Z 
2025-05-07T20:25:35.7744092Z 
2025-05-07T20:25:35.7744097Z 
2025-05-07T20:25:35.7745691Z 
2025-05-07T20:25:35.8249008Z cuda-nsight-12.8.55  | 113.2 MB  | 5          |   5% [A[A[A[A[A[A
2025-05-07T20:25:35.8421112Z libcublas-12.8.3.14  | 460.2 MB  | ######     |  61% 
2025-05-07T20:25:35.8421480Z 
2025-05-07T20:25:35.8421635Z 
2025-05-07T20:25:35.8421642Z 
2025-05-07T20:25:35.8421647Z 
2025-05-07T20:25:35.8421990Z 
2025-05-07T20:25:35.8747675Z libnpp-12.3.3.65     | 130.6 MB  | 6          |   7% [A[A[A[A[A
2025-05-07T20:25:35.8748095Z 
2025-05-07T20:25:35.8748101Z 
2025-05-07T20:25:35.8748106Z 
2025-05-07T20:25:35.8748111Z 
2025-05-07T20:25:35.8748116Z 
2025-05-07T20:25:35.8750717Z 
2025-05-07T20:25:35.8774323Z cuda-nsight-12.8.55  | 113.2 MB  | 7          |   8% [A[A[A[A[A[A
2025-05-07T20:25:35.8780814Z 
2025-05-07T20:25:35.9043479Z nsight-compute-2025. | 320.6 MB  | #########2 |  93% [A
2025-05-07T20:25:35.9043857Z 
2025-05-07T20:25:35.9043900Z 
2025-05-07T20:25:35.9425070Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:25:35.9425474Z 
2025-05-07T20:25:35.9425480Z 
2025-05-07T20:25:35.9425485Z 
2025-05-07T20:25:35.9425490Z 
2025-05-07T20:25:35.9425504Z 
2025-05-07T20:25:35.9551370Z libnpp-12.3.3.65     | 130.6 MB  | 8          |   9% [A[A[A[A[A
2025-05-07T20:25:35.9551787Z 
2025-05-07T20:25:35.9551794Z 
2025-05-07T20:25:35.9551799Z 
2025-05-07T20:25:35.9551804Z 
2025-05-07T20:25:35.9551824Z 
2025-05-07T20:25:35.9551830Z 
2025-05-07T20:25:35.9561669Z 
2025-05-07T20:25:35.9633405Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:36.0197807Z libcublas-12.8.3.14  | 460.2 MB  | ######1    |  61% 
2025-05-07T20:25:36.0198141Z 
2025-05-07T20:25:36.0198148Z 
2025-05-07T20:25:36.0198165Z 
2025-05-07T20:25:36.0198170Z 
2025-05-07T20:25:36.0198175Z 
2025-05-07T20:25:36.0198180Z 
2025-05-07T20:25:36.0366176Z cuda-nsight-12.8.55  | 113.2 MB  | #          |  10% [A[A[A[A[A[A
2025-05-07T20:25:36.0372403Z 
2025-05-07T20:25:36.0436663Z nsight-compute-2025. | 320.6 MB  | #########4 |  94% [A
2025-05-07T20:25:36.0436948Z 
2025-05-07T20:25:36.0436952Z 
2025-05-07T20:25:36.0436956Z 
2025-05-07T20:25:36.0436959Z 
2025-05-07T20:25:36.0438424Z 
2025-05-07T20:25:36.0553804Z libnpp-12.3.3.65     | 130.6 MB  | #          |  11% [A[A[A[A[A
2025-05-07T20:25:36.0554199Z 
2025-05-07T20:25:36.0554212Z 
2025-05-07T20:25:36.0554216Z 
2025-05-07T20:25:36.0554220Z 
2025-05-07T20:25:36.0554223Z 
2025-05-07T20:25:36.0554227Z 
2025-05-07T20:25:36.0557947Z 
2025-05-07T20:25:36.0839996Z cuda-nvvp-12.8.57    | 112.4 MB  | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:25:36.1199371Z libcublas-12.8.3.14  | 460.2 MB  | ######2    |  62% 
2025-05-07T20:25:36.1199661Z 
2025-05-07T20:25:36.1199665Z 
2025-05-07T20:25:36.1199669Z 
2025-05-07T20:25:36.1199682Z 
2025-05-07T20:25:36.1199686Z 
2025-05-07T20:25:36.1205454Z 
2025-05-07T20:25:36.1537505Z cuda-nsight-12.8.55  | 113.2 MB  | #2         |  12% [A[A[A[A[A[A
2025-05-07T20:25:36.1538338Z 
2025-05-07T20:25:36.1538346Z 
2025-05-07T20:25:36.1538362Z 
2025-05-07T20:25:36.1538368Z 
2025-05-07T20:25:36.1540932Z 
2025-05-07T20:25:36.1560610Z libnpp-12.3.3.65     | 130.6 MB  | #2         |  13% [A[A[A[A[A
2025-05-07T20:25:36.1561009Z 
2025-05-07T20:25:36.1561025Z 
2025-05-07T20:25:36.1561030Z 
2025-05-07T20:25:36.1561035Z 
2025-05-07T20:25:36.1561040Z 
2025-05-07T20:25:36.1561045Z 
2025-05-07T20:25:36.1571111Z 
2025-05-07T20:25:36.1852202Z cuda-nvvp-12.8.57    | 112.4 MB  | 4          |   4% [A[A[A[A[A[A[A
2025-05-07T20:25:36.1854020Z 
2025-05-07T20:25:36.2182158Z nsight-compute-2025. | 320.6 MB  | #########5 |  95% [A
2025-05-07T20:25:36.2202366Z libcublas-12.8.3.14  | 460.2 MB  | ######2    |  63% 
2025-05-07T20:25:36.2202694Z 
2025-05-07T20:25:36.2202699Z 
2025-05-07T20:25:36.2202703Z 
2025-05-07T20:25:36.2202707Z 
2025-05-07T20:25:36.2202711Z 
2025-05-07T20:25:36.2202715Z 
2025-05-07T20:25:36.2564365Z cuda-nsight-12.8.55  | 113.2 MB  | #4         |  14% [A[A[A[A[A[A
2025-05-07T20:25:36.2564842Z 
2025-05-07T20:25:36.2564848Z 
2025-05-07T20:25:36.2564853Z 
2025-05-07T20:25:36.2564859Z 
2025-05-07T20:25:36.2564864Z 
2025-05-07T20:25:36.2564869Z 
2025-05-07T20:25:36.2565002Z 
2025-05-07T20:25:36.2595828Z cuda-nvvp-12.8.57    | 112.4 MB  | 6          |   7% [A[A[A[A[A[A[A
2025-05-07T20:25:36.2596164Z 
2025-05-07T20:25:36.2596170Z 
2025-05-07T20:25:36.2596181Z 
2025-05-07T20:25:36.2596186Z 
2025-05-07T20:25:36.2596191Z 
2025-05-07T20:25:36.3066463Z libnpp-12.3.3.65     | 130.6 MB  | #4         |  15% [A[A[A[A[A
2025-05-07T20:25:36.3067613Z 
2025-05-07T20:25:36.3228789Z nsight-compute-2025. | 320.6 MB  | #########6 |  96% [A
2025-05-07T20:25:36.3229077Z 
2025-05-07T20:25:36.3229083Z 
2025-05-07T20:25:36.3229088Z 
2025-05-07T20:25:36.3229093Z 
2025-05-07T20:25:36.3229098Z 
2025-05-07T20:25:36.3229101Z 
2025-05-07T20:25:36.3388254Z cuda-nsight-12.8.55  | 113.2 MB  | #6         |  16% [A[A[A[A[A[A
2025-05-07T20:25:36.3565188Z libcublas-12.8.3.14  | 460.2 MB  | ######3    |  63% 
2025-05-07T20:25:36.3565577Z 
2025-05-07T20:25:36.3565583Z 
2025-05-07T20:25:36.3565589Z 
2025-05-07T20:25:36.3565594Z 
2025-05-07T20:25:36.3565599Z 
2025-05-07T20:25:36.3565603Z 
2025-05-07T20:25:36.3567881Z 
2025-05-07T20:25:36.3609182Z cuda-nvvp-12.8.57    | 112.4 MB  | 8          |   9% [A[A[A[A[A[A[A
2025-05-07T20:25:36.3609591Z 
2025-05-07T20:25:36.3609596Z 
2025-05-07T20:25:36.3609600Z 
2025-05-07T20:25:36.3609605Z 
2025-05-07T20:25:36.3609609Z 
2025-05-07T20:25:36.4249348Z libnpp-12.3.3.65     | 130.6 MB  | #6         |  17% [A[A[A[A[A
2025-05-07T20:25:36.4249674Z 
2025-05-07T20:25:36.4249679Z 
2025-05-07T20:25:36.4249682Z 
2025-05-07T20:25:36.4249686Z 
2025-05-07T20:25:36.4249690Z 
2025-05-07T20:25:36.4251299Z 
2025-05-07T20:25:36.4273190Z cuda-nsight-12.8.55  | 113.2 MB  | #8         |  18% [A[A[A[A[A[A
2025-05-07T20:25:36.4273626Z 
2025-05-07T20:25:36.4566516Z nsight-compute-2025. | 320.6 MB  | #########6 |  97% [A
2025-05-07T20:25:36.4566846Z 
2025-05-07T20:25:36.4567079Z 
2025-05-07T20:25:36.4567093Z 
2025-05-07T20:25:36.4567097Z 
2025-05-07T20:25:36.4567101Z 
2025-05-07T20:25:36.4567105Z 
2025-05-07T20:25:36.4567894Z 
2025-05-07T20:25:36.4607805Z cuda-nvvp-12.8.57    | 112.4 MB  | #          |  11% [A[A[A[A[A[A[A
2025-05-07T20:25:36.4612135Z libcublas-12.8.3.14  | 460.2 MB  | ######3    |  64% 
2025-05-07T20:25:36.4612486Z 
2025-05-07T20:25:36.4612493Z 
2025-05-07T20:25:36.4612498Z 
2025-05-07T20:25:36.4612503Z 
2025-05-07T20:25:36.4612509Z 
2025-05-07T20:25:36.5250433Z libnpp-12.3.3.65     | 130.6 MB  | #8         |  19% [A[A[A[A[A
2025-05-07T20:25:36.5250747Z 
2025-05-07T20:25:36.5250751Z 
2025-05-07T20:25:36.5250755Z 
2025-05-07T20:25:36.5250759Z 
2025-05-07T20:25:36.5250763Z 
2025-05-07T20:25:36.5251409Z 
2025-05-07T20:25:36.5431283Z cuda-nsight-12.8.55  | 113.2 MB  | ##         |  20% [A[A[A[A[A[A
2025-05-07T20:25:36.5436313Z 
2025-05-07T20:25:36.5566314Z nsight-compute-2025. | 320.6 MB  | #########7 |  98% [A
2025-05-07T20:25:36.5566950Z 
2025-05-07T20:25:36.5566957Z 
2025-05-07T20:25:36.5566960Z 
2025-05-07T20:25:36.5566964Z 
2025-05-07T20:25:36.5566968Z 
2025-05-07T20:25:36.5566971Z 
2025-05-07T20:25:36.5567641Z 
2025-05-07T20:25:36.5654135Z cuda-nvvp-12.8.57    | 112.4 MB  | #2         |  13% [A[A[A[A[A[A[A
2025-05-07T20:25:36.5689318Z libcublas-12.8.3.14  | 460.2 MB  | ######4    |  64% 
2025-05-07T20:25:36.5689676Z 
2025-05-07T20:25:36.5689683Z 
2025-05-07T20:25:36.5689688Z 
2025-05-07T20:25:36.5689693Z 
2025-05-07T20:25:36.5689699Z 
2025-05-07T20:25:36.6250415Z libnpp-12.3.3.65     | 130.6 MB  | ##         |  21% [A[A[A[A[A
2025-05-07T20:25:36.6250729Z 
2025-05-07T20:25:36.6250733Z 
2025-05-07T20:25:36.6250738Z 
2025-05-07T20:25:36.6250742Z 
2025-05-07T20:25:36.6250747Z 
2025-05-07T20:25:36.6250765Z 
2025-05-07T20:25:36.6645097Z cuda-nsight-12.8.55  | 113.2 MB  | ##2        |  22% [A[A[A[A[A[A
2025-05-07T20:25:36.6645425Z 
2025-05-07T20:25:36.6645429Z 
2025-05-07T20:25:36.6645433Z 
2025-05-07T20:25:36.6645477Z 
2025-05-07T20:25:36.6645494Z 
2025-05-07T20:25:36.6645498Z 
2025-05-07T20:25:36.6651242Z 
2025-05-07T20:25:36.6670798Z cuda-nvvp-12.8.57    | 112.4 MB  | #4         |  15% [A[A[A[A[A[A[A
2025-05-07T20:25:36.6671092Z 
2025-05-07T20:25:36.6758398Z nsight-compute-2025. | 320.6 MB  | #########8 |  99% [A
2025-05-07T20:25:36.6828610Z libcublas-12.8.3.14  | 460.2 MB  | ######5    |  65% 
2025-05-07T20:25:36.6828862Z 
2025-05-07T20:25:36.6828866Z 
2025-05-07T20:25:36.6828870Z 
2025-05-07T20:25:36.6828874Z 
2025-05-07T20:25:36.6831384Z 
2025-05-07T20:25:36.7263560Z libnpp-12.3.3.65     | 130.6 MB  | ##2        |  23% [A[A[A[A[A
2025-05-07T20:25:36.7263989Z 
2025-05-07T20:25:36.7263994Z 
2025-05-07T20:25:36.7263999Z 
2025-05-07T20:25:36.7264005Z 
2025-05-07T20:25:36.7264011Z 
2025-05-07T20:25:36.7267807Z 
2025-05-07T20:25:36.7651041Z cuda-nsight-12.8.55  | 113.2 MB  | ##4        |  24% [A[A[A[A[A[A
2025-05-07T20:25:36.7651381Z 
2025-05-07T20:25:36.7651385Z 
2025-05-07T20:25:36.7651421Z 
2025-05-07T20:25:36.7651435Z 
2025-05-07T20:25:36.7651439Z 
2025-05-07T20:25:36.7651444Z 
2025-05-07T20:25:36.7659552Z 
2025-05-07T20:25:36.7749434Z cuda-nvvp-12.8.57    | 112.4 MB  | #6         |  17% [A[A[A[A[A[A[A
2025-05-07T20:25:36.7750017Z 
2025-05-07T20:25:36.7840175Z nsight-compute-2025. | 320.6 MB  | #########9 |  99% [A
2025-05-07T20:25:36.7860172Z libcublas-12.8.3.14  | 460.2 MB  | ######5    |  66% 
2025-05-07T20:25:36.7860542Z 
2025-05-07T20:25:36.7860549Z 
2025-05-07T20:25:36.7860554Z 
2025-05-07T20:25:36.7860559Z 
2025-05-07T20:25:36.7862353Z 
2025-05-07T20:25:36.8265364Z libnpp-12.3.3.65     | 130.6 MB  | ##4        |  24% [A[A[A[A[A
2025-05-07T20:25:36.8265761Z 
2025-05-07T20:25:36.8265766Z 
2025-05-07T20:25:36.8265772Z 
2025-05-07T20:25:36.8265778Z 
2025-05-07T20:25:36.8265783Z 
2025-05-07T20:25:36.8265788Z 
2025-05-07T20:25:36.8669415Z cuda-nsight-12.8.55  | 113.2 MB  | ##6        |  27% [A[A[A[A[A[A
2025-05-07T20:25:36.8669952Z 
2025-05-07T20:25:36.8669987Z 
2025-05-07T20:25:36.8670261Z 
2025-05-07T20:25:36.8670268Z 
2025-05-07T20:25:36.8670273Z 
2025-05-07T20:25:36.8670278Z 
2025-05-07T20:25:36.8670698Z 
2025-05-07T20:25:36.8868574Z cuda-nvvp-12.8.57    | 112.4 MB  | #8         |  19% [A[A[A[A[A[A[A
2025-05-07T20:25:36.8868993Z 
2025-05-07T20:25:36.8869000Z 
2025-05-07T20:25:36.8869005Z 
2025-05-07T20:25:36.8869010Z 
2025-05-07T20:25:36.8869862Z 
2025-05-07T20:25:36.9265958Z libnpp-12.3.3.65     | 130.6 MB  | ##6        |  26% [A[A[A[A[A
2025-05-07T20:25:36.9266365Z 
2025-05-07T20:25:36.9266370Z 
2025-05-07T20:25:36.9266376Z 
2025-05-07T20:25:36.9266381Z 
2025-05-07T20:25:36.9266386Z 
2025-05-07T20:25:36.9269048Z 
2025-05-07T20:25:36.9709870Z cuda-nsight-12.8.55  | 113.2 MB  | ##9        |  29% [A[A[A[A[A[A
2025-05-07T20:25:36.9870257Z libcublas-12.8.3.14  | 460.2 MB  | ######6    |  66% 
2025-05-07T20:25:36.9870630Z 
2025-05-07T20:25:36.9870636Z 
2025-05-07T20:25:36.9870641Z 
2025-05-07T20:25:36.9870646Z 
2025-05-07T20:25:36.9874678Z 
2025-05-07T20:25:37.0266567Z libnpp-12.3.3.65     | 130.6 MB  | ##8        |  29% [A[A[A[A[A
2025-05-07T20:25:37.0266982Z 
2025-05-07T20:25:37.0266988Z 
2025-05-07T20:25:37.0266993Z 
2025-05-07T20:25:37.0266998Z 
2025-05-07T20:25:37.0267004Z 
2025-05-07T20:25:37.0275968Z 
2025-05-07T20:25:37.0284663Z cuda-nsight-12.8.55  | 113.2 MB  | ###2       |  32% [A[A[A[A[A[A
2025-05-07T20:25:37.0285088Z 
2025-05-07T20:25:37.0285094Z 
2025-05-07T20:25:37.0285099Z 
2025-05-07T20:25:37.0285104Z 
2025-05-07T20:25:37.0285109Z 
2025-05-07T20:25:37.0285114Z 
2025-05-07T20:25:37.0285119Z 
2025-05-07T20:25:37.0711891Z cuda-nvvp-12.8.57    | 112.4 MB  | ##         |  21% [A[A[A[A[A[A[A
2025-05-07T20:25:37.0965440Z libcublas-12.8.3.14  | 460.2 MB  | ######6    |  67% 
2025-05-07T20:25:37.0965817Z 
2025-05-07T20:25:37.0965823Z 
2025-05-07T20:25:37.0965828Z 
2025-05-07T20:25:37.0965833Z 
2025-05-07T20:25:37.0965838Z 
2025-05-07T20:25:37.1287260Z libnpp-12.3.3.65     | 130.6 MB  | ###        |  31% [A[A[A[A[A
2025-05-07T20:25:37.1287606Z 
2025-05-07T20:25:37.1287611Z 
2025-05-07T20:25:37.1287617Z 
2025-05-07T20:25:37.1287622Z 
2025-05-07T20:25:37.1287627Z 
2025-05-07T20:25:37.1287632Z 
2025-05-07T20:25:37.1287644Z 
2025-05-07T20:25:37.1298490Z cuda-nvvp-12.8.57    | 112.4 MB  | ##2        |  23% [A[A[A[A[A[A[A
2025-05-07T20:25:37.1298861Z 
2025-05-07T20:25:37.1298867Z 
2025-05-07T20:25:37.1298872Z 
2025-05-07T20:25:37.1298877Z 
2025-05-07T20:25:37.1298894Z 
2025-05-07T20:25:37.1298900Z 
2025-05-07T20:25:37.1721210Z cuda-nsight-12.8.55  | 113.2 MB  | ###4       |  34% [A[A[A[A[A[A
2025-05-07T20:25:37.1985124Z libcublas-12.8.3.14  | 460.2 MB  | ######6    |  67% 
2025-05-07T20:25:37.1985405Z 
2025-05-07T20:25:37.1985411Z 
2025-05-07T20:25:37.1985415Z 
2025-05-07T20:25:37.1985418Z 
2025-05-07T20:25:37.1985422Z 
2025-05-07T20:25:37.2288428Z libnpp-12.3.3.65     | 130.6 MB  | ###2       |  33% [A[A[A[A[A
2025-05-07T20:25:37.2288737Z 
2025-05-07T20:25:37.2288742Z 
2025-05-07T20:25:37.2288745Z 
2025-05-07T20:25:37.2288773Z 
2025-05-07T20:25:37.2288787Z 
2025-05-07T20:25:37.2288791Z 
2025-05-07T20:25:37.2289911Z 
2025-05-07T20:25:37.2379062Z cuda-nvvp-12.8.57    | 112.4 MB  | ##4        |  25% [A[A[A[A[A[A[A
2025-05-07T20:25:37.2379370Z 
2025-05-07T20:25:37.2379375Z 
2025-05-07T20:25:37.2379379Z 
2025-05-07T20:25:37.2379382Z 
2025-05-07T20:25:37.2379386Z 
2025-05-07T20:25:37.2379400Z 
2025-05-07T20:25:37.2727257Z cuda-nsight-12.8.55  | 113.2 MB  | ###6       |  37% [A[A[A[A[A[A
2025-05-07T20:25:37.3039999Z libcublas-12.8.3.14  | 460.2 MB  | ######7    |  68% 
2025-05-07T20:25:37.3040301Z 
2025-05-07T20:25:37.3040306Z 
2025-05-07T20:25:37.3040310Z 
2025-05-07T20:25:37.3040313Z 
2025-05-07T20:25:37.3041442Z 
2025-05-07T20:25:37.3291701Z libnpp-12.3.3.65     | 130.6 MB  | ###4       |  34% [A[A[A[A[A
2025-05-07T20:25:37.3292007Z 
2025-05-07T20:25:37.3292012Z 
2025-05-07T20:25:37.3292015Z 
2025-05-07T20:25:37.3292019Z 
2025-05-07T20:25:37.3292023Z 
2025-05-07T20:25:37.3292026Z 
2025-05-07T20:25:37.3294590Z 
2025-05-07T20:25:37.3442878Z cuda-nvvp-12.8.57    | 112.4 MB  | ##7        |  27% [A[A[A[A[A[A[A
2025-05-07T20:25:37.3443251Z 
2025-05-07T20:25:37.3443257Z 
2025-05-07T20:25:37.3443262Z 
2025-05-07T20:25:37.3443266Z 
2025-05-07T20:25:37.3443271Z 
2025-05-07T20:25:37.3450000Z 
2025-05-07T20:25:37.3727549Z cuda-nsight-12.8.55  | 113.2 MB  | ###9       |  39% [A[A[A[A[A[A
2025-05-07T20:25:37.4040754Z libcublas-12.8.3.14  | 460.2 MB  | ######8    |  68% 
2025-05-07T20:25:37.4041044Z 
2025-05-07T20:25:37.4041049Z 
2025-05-07T20:25:37.4041053Z 
2025-05-07T20:25:37.4041057Z 
2025-05-07T20:25:37.4046187Z 
2025-05-07T20:25:37.4296701Z libnpp-12.3.3.65     | 130.6 MB  | ###6       |  36% [A[A[A[A[A
2025-05-07T20:25:37.4297033Z 
2025-05-07T20:25:37.4297038Z 
2025-05-07T20:25:37.4297042Z 
2025-05-07T20:25:37.4297046Z 
2025-05-07T20:25:37.4297058Z 
2025-05-07T20:25:37.4297063Z 
2025-05-07T20:25:37.4302635Z 
2025-05-07T20:25:37.4450640Z cuda-nvvp-12.8.57    | 112.4 MB  | ##9        |  30% [A[A[A[A[A[A[A
2025-05-07T20:25:37.4451346Z 
2025-05-07T20:25:37.4451363Z 
2025-05-07T20:25:37.4451369Z 
2025-05-07T20:25:37.4451374Z 
2025-05-07T20:25:37.4451379Z 
2025-05-07T20:25:37.4451384Z 
2025-05-07T20:25:37.4735729Z cuda-nsight-12.8.55  | 113.2 MB  | ####1      |  42% [A[A[A[A[A[A
2025-05-07T20:25:37.5061597Z libcublas-12.8.3.14  | 460.2 MB  | ######8    |  69% 
2025-05-07T20:25:37.5062002Z 
2025-05-07T20:25:37.5062008Z 
2025-05-07T20:25:37.5062014Z 
2025-05-07T20:25:37.5062020Z 
2025-05-07T20:25:37.5063526Z 
2025-05-07T20:25:37.5338446Z libnpp-12.3.3.65     | 130.6 MB  | ###8       |  38% [A[A[A[A[A
2025-05-07T20:25:37.5338756Z 
2025-05-07T20:25:37.5338760Z 
2025-05-07T20:25:37.5338764Z 
2025-05-07T20:25:37.5338768Z 
2025-05-07T20:25:37.5338772Z 
2025-05-07T20:25:37.5338785Z 
2025-05-07T20:25:37.5343853Z 
2025-05-07T20:25:37.5459467Z cuda-nvvp-12.8.57    | 112.4 MB  | ###1       |  32% [A[A[A[A[A[A[A
2025-05-07T20:25:37.5459773Z 
2025-05-07T20:25:37.5459787Z 
2025-05-07T20:25:37.5459819Z 
2025-05-07T20:25:37.5459833Z 
2025-05-07T20:25:37.5459837Z 
2025-05-07T20:25:37.5461546Z 
2025-05-07T20:25:37.5832363Z cuda-nsight-12.8.55  | 113.2 MB  | ####4      |  44% [A[A[A[A[A[A
2025-05-07T20:25:37.6121420Z libcublas-12.8.3.14  | 460.2 MB  | ######9    |  69% 
2025-05-07T20:25:37.6121696Z 
2025-05-07T20:25:37.6121700Z 
2025-05-07T20:25:37.6121704Z 
2025-05-07T20:25:37.6121708Z 
2025-05-07T20:25:37.6121714Z 
2025-05-07T20:25:37.6341493Z libnpp-12.3.3.65     | 130.6 MB  | ####       |  40% [A[A[A[A[A
2025-05-07T20:25:37.6341782Z 
2025-05-07T20:25:37.6341786Z 
2025-05-07T20:25:37.6341790Z 
2025-05-07T20:25:37.6341793Z 
2025-05-07T20:25:37.6341800Z 
2025-05-07T20:25:37.6341811Z 
2025-05-07T20:25:37.6343448Z 
2025-05-07T20:25:37.6469207Z cuda-nvvp-12.8.57    | 112.4 MB  | ###4       |  34% [A[A[A[A[A[A[A
2025-05-07T20:25:37.6469562Z 
2025-05-07T20:25:37.6469567Z 
2025-05-07T20:25:37.6469571Z 
2025-05-07T20:25:37.6469575Z 
2025-05-07T20:25:37.6469578Z 
2025-05-07T20:25:37.6473449Z 
2025-05-07T20:25:37.6865155Z cuda-nsight-12.8.55  | 113.2 MB  | ####6      |  47% [A[A[A[A[A[A
2025-05-07T20:25:37.7181627Z libcublas-12.8.3.14  | 460.2 MB  | ######9    |  70% 
2025-05-07T20:25:37.7181918Z 
2025-05-07T20:25:37.7181922Z 
2025-05-07T20:25:37.7181926Z 
2025-05-07T20:25:37.7181930Z 
2025-05-07T20:25:37.7181934Z 
2025-05-07T20:25:37.7375439Z libnpp-12.3.3.65     | 130.6 MB  | ####2      |  42% [A[A[A[A[A
2025-05-07T20:25:37.7375734Z 
2025-05-07T20:25:37.7375738Z 
2025-05-07T20:25:37.7375743Z 
2025-05-07T20:25:37.7375747Z 
2025-05-07T20:25:37.7375751Z 
2025-05-07T20:25:37.7375755Z 
2025-05-07T20:25:37.7375759Z 
2025-05-07T20:25:37.7489238Z cuda-nvvp-12.8.57    | 112.4 MB  | ###6       |  36% [A[A[A[A[A[A[A
2025-05-07T20:25:37.7489664Z 
2025-05-07T20:25:37.7489669Z 
2025-05-07T20:25:37.7489674Z 
2025-05-07T20:25:37.7489680Z 
2025-05-07T20:25:37.7489686Z 
2025-05-07T20:25:37.7489691Z 
2025-05-07T20:25:37.7892564Z cuda-nsight-12.8.55  | 113.2 MB  | ####9      |  49% [A[A[A[A[A[A
2025-05-07T20:25:37.8185093Z libcublas-12.8.3.14  | 460.2 MB  | #######    |  70% 
2025-05-07T20:25:37.8185472Z 
2025-05-07T20:25:37.8185478Z 
2025-05-07T20:25:37.8185483Z 
2025-05-07T20:25:37.8185488Z 
2025-05-07T20:25:37.8185493Z 
2025-05-07T20:25:37.8382636Z libnpp-12.3.3.65     | 130.6 MB  | ####4      |  44% [A[A[A[A[A
2025-05-07T20:25:37.8383218Z 
2025-05-07T20:25:37.8383223Z 
2025-05-07T20:25:37.8383229Z 
2025-05-07T20:25:37.8383234Z 
2025-05-07T20:25:37.8383239Z 
2025-05-07T20:25:37.8383244Z 
2025-05-07T20:25:37.8388234Z 
2025-05-07T20:25:37.8534478Z cuda-nvvp-12.8.57    | 112.4 MB  | ###8       |  39% [A[A[A[A[A[A[A
2025-05-07T20:25:37.8534898Z 
2025-05-07T20:25:37.8534904Z 
2025-05-07T20:25:37.8534909Z 
2025-05-07T20:25:37.8534915Z 
2025-05-07T20:25:37.8534920Z 
2025-05-07T20:25:37.8534925Z 
2025-05-07T20:25:37.8926385Z cuda-nsight-12.8.55  | 113.2 MB  | #####1     |  51% [A[A[A[A[A[A
2025-05-07T20:25:37.9185544Z libcublas-12.8.3.14  | 460.2 MB  | #######    |  71% 
2025-05-07T20:25:37.9186163Z 
2025-05-07T20:25:37.9186184Z 
2025-05-07T20:25:37.9186198Z 
2025-05-07T20:25:37.9186203Z 
2025-05-07T20:25:37.9186213Z 
2025-05-07T20:25:37.9383526Z libnpp-12.3.3.65     | 130.6 MB  | ####6      |  46% [A[A[A[A[A
2025-05-07T20:25:37.9383912Z 
2025-05-07T20:25:37.9383917Z 
2025-05-07T20:25:37.9383941Z 
2025-05-07T20:25:37.9383946Z 
2025-05-07T20:25:37.9383951Z 
2025-05-07T20:25:37.9383956Z 
2025-05-07T20:25:37.9383962Z 
2025-05-07T20:25:37.9574396Z cuda-nvvp-12.8.57    | 112.4 MB  | ####1      |  41% [A[A[A[A[A[A[A
2025-05-07T20:25:37.9574952Z 
2025-05-07T20:25:37.9574959Z 
2025-05-07T20:25:37.9574964Z 
2025-05-07T20:25:37.9574969Z 
2025-05-07T20:25:37.9574974Z 
2025-05-07T20:25:37.9574980Z 
2025-05-07T20:25:37.9926337Z cuda-nsight-12.8.55  | 113.2 MB  | #####3     |  54% [A[A[A[A[A[A
2025-05-07T20:25:38.0224613Z libcublas-12.8.3.14  | 460.2 MB  | #######1   |  71% 
2025-05-07T20:25:38.0224993Z 
2025-05-07T20:25:38.0225000Z 
2025-05-07T20:25:38.0225005Z 
2025-05-07T20:25:38.0225039Z 
2025-05-07T20:25:38.0226540Z 
2025-05-07T20:25:38.0412037Z libnpp-12.3.3.65     | 130.6 MB  | ####8      |  48% [A[A[A[A[A
2025-05-07T20:25:38.0412866Z 
2025-05-07T20:25:38.0412872Z 
2025-05-07T20:25:38.0412877Z 
2025-05-07T20:25:38.0412891Z 
2025-05-07T20:25:38.0412896Z 
2025-05-07T20:25:38.0412901Z 
2025-05-07T20:25:38.0414022Z 
2025-05-07T20:25:38.0579239Z cuda-nvvp-12.8.57    | 112.4 MB  | ####3      |  43% [A[A[A[A[A[A[A
2025-05-07T20:25:38.0579650Z 
2025-05-07T20:25:38.0579656Z 
2025-05-07T20:25:38.0579661Z 
2025-05-07T20:25:38.0579666Z 
2025-05-07T20:25:38.0579672Z 
2025-05-07T20:25:38.0580814Z 
2025-05-07T20:25:38.0927966Z cuda-nsight-12.8.55  | 113.2 MB  | #####6     |  56% [A[A[A[A[A[A
2025-05-07T20:25:38.1259428Z libcublas-12.8.3.14  | 460.2 MB  | #######1   |  72% 
2025-05-07T20:25:38.1259796Z 
2025-05-07T20:25:38.1259800Z 
2025-05-07T20:25:38.1259805Z 
2025-05-07T20:25:38.1259808Z 
2025-05-07T20:25:38.1261037Z 
2025-05-07T20:25:38.1551556Z libnpp-12.3.3.65     | 130.6 MB  | #####      |  50% [A[A[A[A[A
2025-05-07T20:25:38.1551998Z 
2025-05-07T20:25:38.1552003Z 
2025-05-07T20:25:38.1552007Z 
2025-05-07T20:25:38.1552010Z 
2025-05-07T20:25:38.1552014Z 
2025-05-07T20:25:38.1552018Z 
2025-05-07T20:25:38.1552022Z 
2025-05-07T20:25:38.1608276Z cuda-nvvp-12.8.57    | 112.4 MB  | ####5      |  46% [A[A[A[A[A[A[A
2025-05-07T20:25:38.1608584Z 
2025-05-07T20:25:38.1608589Z 
2025-05-07T20:25:38.1608592Z 
2025-05-07T20:25:38.1608596Z 
2025-05-07T20:25:38.1608600Z 
2025-05-07T20:25:38.1610950Z 
2025-05-07T20:25:38.1934787Z cuda-nsight-12.8.55  | 113.2 MB  | #####8     |  59% [A[A[A[A[A[A
2025-05-07T20:25:38.2263177Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  72% 
2025-05-07T20:25:38.2263605Z 
2025-05-07T20:25:38.2263609Z 
2025-05-07T20:25:38.2263613Z 
2025-05-07T20:25:38.2263617Z 
2025-05-07T20:25:38.2266206Z 
2025-05-07T20:25:38.2560247Z libnpp-12.3.3.65     | 130.6 MB  | #####2     |  52% [A[A[A[A[A
2025-05-07T20:25:38.2560643Z 
2025-05-07T20:25:38.2560674Z 
2025-05-07T20:25:38.2560940Z 
2025-05-07T20:25:38.2560950Z 
2025-05-07T20:25:38.2560955Z 
2025-05-07T20:25:38.2560959Z 
2025-05-07T20:25:38.2564041Z 
2025-05-07T20:25:38.2691120Z cuda-nvvp-12.8.57    | 112.4 MB  | ####7      |  48% [A[A[A[A[A[A[A
2025-05-07T20:25:38.2691437Z 
2025-05-07T20:25:38.2691441Z 
2025-05-07T20:25:38.2691445Z 
2025-05-07T20:25:38.2691448Z 
2025-05-07T20:25:38.2691452Z 
2025-05-07T20:25:38.2691456Z 
2025-05-07T20:25:38.2934804Z cuda-nsight-12.8.55  | 113.2 MB  | ######1    |  61% [A[A[A[A[A[A
2025-05-07T20:25:38.3271383Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  73% 
2025-05-07T20:25:38.3271692Z 
2025-05-07T20:25:38.3271696Z 
2025-05-07T20:25:38.3271700Z 
2025-05-07T20:25:38.3271704Z 
2025-05-07T20:25:38.3272894Z 
2025-05-07T20:25:38.3564945Z libnpp-12.3.3.65     | 130.6 MB  | #####4     |  54% [A[A[A[A[A
2025-05-07T20:25:38.3565260Z 
2025-05-07T20:25:38.3565264Z 
2025-05-07T20:25:38.3565268Z 
2025-05-07T20:25:38.3565272Z 
2025-05-07T20:25:38.3565276Z 
2025-05-07T20:25:38.3565541Z 
2025-05-07T20:25:38.3567656Z 
2025-05-07T20:25:38.3735375Z cuda-nvvp-12.8.57    | 112.4 MB  | #####      |  50% [A[A[A[A[A[A[A
2025-05-07T20:25:38.3735743Z 
2025-05-07T20:25:38.3735749Z 
2025-05-07T20:25:38.3735762Z 
2025-05-07T20:25:38.3735767Z 
2025-05-07T20:25:38.3735772Z 
2025-05-07T20:25:38.3737333Z 
2025-05-07T20:25:38.4273932Z cuda-nsight-12.8.55  | 113.2 MB  | ######3    |  63% [A[A[A[A[A[A
2025-05-07T20:25:38.4274350Z 
2025-05-07T20:25:38.4274356Z 
2025-05-07T20:25:38.4274361Z 
2025-05-07T20:25:38.4274365Z 
2025-05-07T20:25:38.4275685Z 
2025-05-07T20:25:38.4737613Z libnpp-12.3.3.65     | 130.6 MB  | #####6     |  57% [A[A[A[A[A
2025-05-07T20:25:38.4737941Z 
2025-05-07T20:25:38.4737946Z 
2025-05-07T20:25:38.4737950Z 
2025-05-07T20:25:38.4737962Z 
2025-05-07T20:25:38.4737966Z 
2025-05-07T20:25:38.4739187Z 
2025-05-07T20:25:38.5060095Z cuda-nsight-12.8.55  | 113.2 MB  | ######6    |  66% [A[A[A[A[A[A
2025-05-07T20:25:38.5275816Z libcublas-12.8.3.14  | 460.2 MB  | #######3   |  73% 
2025-05-07T20:25:38.5276249Z 
2025-05-07T20:25:38.5276254Z 
2025-05-07T20:25:38.5276258Z 
2025-05-07T20:25:38.5276262Z 
2025-05-07T20:25:38.5276265Z 
2025-05-07T20:25:38.5400025Z libnpp-12.3.3.65     | 130.6 MB  | #####9     |  59% [A[A[A[A[A
2025-05-07T20:25:38.5400394Z 
2025-05-07T20:25:38.5400398Z 
2025-05-07T20:25:38.5400402Z 
2025-05-07T20:25:38.5400405Z 
2025-05-07T20:25:38.5400409Z 
2025-05-07T20:25:38.5400413Z 
2025-05-07T20:25:38.5406270Z 
2025-05-07T20:25:38.5784278Z cuda-nvvp-12.8.57    | 112.4 MB  | #####2     |  53% [A[A[A[A[A[A[A
2025-05-07T20:25:38.5784639Z 
2025-05-07T20:25:38.5784643Z 
2025-05-07T20:25:38.5784647Z 
2025-05-07T20:25:38.5784651Z 
2025-05-07T20:25:38.5784666Z 
2025-05-07T20:25:38.5786055Z 
2025-05-07T20:25:38.6061582Z cuda-nsight-12.8.55  | 113.2 MB  | ######8    |  69% [A[A[A[A[A[A
2025-05-07T20:25:38.6319701Z libcublas-12.8.3.14  | 460.2 MB  | #######3   |  74% 
2025-05-07T20:25:38.6320051Z 
2025-05-07T20:25:38.6320056Z 
2025-05-07T20:25:38.6320089Z 
2025-05-07T20:25:38.6320105Z 
2025-05-07T20:25:38.6321437Z 
2025-05-07T20:25:38.6406114Z libnpp-12.3.3.65     | 130.6 MB  | ######1    |  61% [A[A[A[A[A
2025-05-07T20:25:38.6406512Z 
2025-05-07T20:25:38.6406520Z 
2025-05-07T20:25:38.6406529Z 
2025-05-07T20:25:38.6406538Z 
2025-05-07T20:25:38.6406546Z 
2025-05-07T20:25:38.6406555Z 
2025-05-07T20:25:38.6406563Z 
2025-05-07T20:25:38.6784468Z cuda-nvvp-12.8.57    | 112.4 MB  | #####5     |  55% [A[A[A[A[A[A[A
2025-05-07T20:25:38.6784939Z 
2025-05-07T20:25:38.6784947Z 
2025-05-07T20:25:38.6784953Z 
2025-05-07T20:25:38.6784960Z 
2025-05-07T20:25:38.6784966Z 
2025-05-07T20:25:38.6787405Z 
2025-05-07T20:25:38.7061936Z cuda-nsight-12.8.55  | 113.2 MB  | #######1   |  71% [A[A[A[A[A[A
2025-05-07T20:25:38.7408834Z libcublas-12.8.3.14  | 460.2 MB  | #######4   |  74% 
2025-05-07T20:25:38.7409257Z 
2025-05-07T20:25:38.7409266Z 
2025-05-07T20:25:38.7409273Z 
2025-05-07T20:25:38.7409278Z 
2025-05-07T20:25:38.7409283Z 
2025-05-07T20:25:38.7409321Z 
2025-05-07T20:25:38.7409606Z 
2025-05-07T20:25:38.7786922Z cuda-nvvp-12.8.57    | 112.4 MB  | #####7     |  57% [A[A[A[A[A[A[A
2025-05-07T20:25:38.7787380Z 
2025-05-07T20:25:38.7787384Z 
2025-05-07T20:25:38.7787388Z 
2025-05-07T20:25:38.7787392Z 
2025-05-07T20:25:38.7787395Z 
2025-05-07T20:25:38.7787399Z 
2025-05-07T20:25:38.8062973Z cuda-nsight-12.8.55  | 113.2 MB  | #######3   |  74% [A[A[A[A[A[A
2025-05-07T20:25:38.8305624Z libcublas-12.8.3.14  | 460.2 MB  | #######4   |  75% 
2025-05-07T20:25:38.8305916Z 
2025-05-07T20:25:38.8305920Z 
2025-05-07T20:25:38.8305924Z 
2025-05-07T20:25:38.8305928Z 
2025-05-07T20:25:38.8305933Z 
2025-05-07T20:25:38.8410034Z libnpp-12.3.3.65     | 130.6 MB  | ######3    |  63% [A[A[A[A[A
2025-05-07T20:25:38.8410496Z 
2025-05-07T20:25:38.8410503Z 
2025-05-07T20:25:38.8410508Z 
2025-05-07T20:25:38.8410513Z 
2025-05-07T20:25:38.8410518Z 
2025-05-07T20:25:38.8410523Z 
2025-05-07T20:25:38.8412381Z 
2025-05-07T20:25:38.8850667Z cuda-nvvp-12.8.57    | 112.4 MB  | ######     |  60% [A[A[A[A[A[A[A
2025-05-07T20:25:38.8851354Z 
2025-05-07T20:25:38.8851359Z 
2025-05-07T20:25:38.8851362Z 
2025-05-07T20:25:38.8851366Z 
2025-05-07T20:25:38.8851370Z 
2025-05-07T20:25:38.8852163Z 
2025-05-07T20:25:38.9066132Z cuda-nsight-12.8.55  | 113.2 MB  | #######6   |  76% [A[A[A[A[A[A
2025-05-07T20:25:38.9312920Z libcublas-12.8.3.14  | 460.2 MB  | #######5   |  76% 
2025-05-07T20:25:38.9313199Z 
2025-05-07T20:25:38.9313381Z 
2025-05-07T20:25:38.9313392Z 
2025-05-07T20:25:38.9313397Z 
2025-05-07T20:25:38.9324816Z 
2025-05-07T20:25:38.9411163Z libnpp-12.3.3.65     | 130.6 MB  | ######5    |  65% [A[A[A[A[A
2025-05-07T20:25:38.9411512Z 
2025-05-07T20:25:38.9411518Z 
2025-05-07T20:25:38.9411523Z 
2025-05-07T20:25:38.9411532Z 
2025-05-07T20:25:38.9411537Z 
2025-05-07T20:25:38.9411542Z 
2025-05-07T20:25:38.9411548Z 
2025-05-07T20:25:38.9871414Z cuda-nvvp-12.8.57    | 112.4 MB  | ######2    |  63% [A[A[A[A[A[A[A
2025-05-07T20:25:38.9871837Z 
2025-05-07T20:25:38.9871874Z 
2025-05-07T20:25:38.9871889Z 
2025-05-07T20:25:38.9871894Z 
2025-05-07T20:25:38.9871899Z 
2025-05-07T20:25:38.9873209Z 
2025-05-07T20:25:39.0067461Z cuda-nsight-12.8.55  | 113.2 MB  | #######8   |  79% [A[A[A[A[A[A
2025-05-07T20:25:39.0315877Z libcublas-12.8.3.14  | 460.2 MB  | #######6   |  76% 
2025-05-07T20:25:39.0316214Z 
2025-05-07T20:25:39.0316218Z 
2025-05-07T20:25:39.0316222Z 
2025-05-07T20:25:39.0316225Z 
2025-05-07T20:25:39.0318817Z 
2025-05-07T20:25:39.0416900Z libnpp-12.3.3.65     | 130.6 MB  | ######7    |  68% [A[A[A[A[A
2025-05-07T20:25:39.0417272Z 
2025-05-07T20:25:39.0417276Z 
2025-05-07T20:25:39.0417280Z 
2025-05-07T20:25:39.0417284Z 
2025-05-07T20:25:39.0417287Z 
2025-05-07T20:25:39.0417291Z 
2025-05-07T20:25:39.0417574Z 
2025-05-07T20:25:39.0958526Z cuda-nvvp-12.8.57    | 112.4 MB  | ######5    |  65% [A[A[A[A[A[A[A
2025-05-07T20:25:39.0958853Z 
2025-05-07T20:25:39.0958858Z 
2025-05-07T20:25:39.0958861Z 
2025-05-07T20:25:39.0958865Z 
2025-05-07T20:25:39.0958897Z 
2025-05-07T20:25:39.0959361Z 
2025-05-07T20:25:39.1068553Z cuda-nsight-12.8.55  | 113.2 MB  | ########1  |  81% [A[A[A[A[A[A
2025-05-07T20:25:39.1374175Z libcublas-12.8.3.14  | 460.2 MB  | #######6   |  77% 
2025-05-07T20:25:39.1374514Z 
2025-05-07T20:25:39.1374521Z 
2025-05-07T20:25:39.1374526Z 
2025-05-07T20:25:39.1374531Z 
2025-05-07T20:25:39.1374537Z 
2025-05-07T20:25:39.1421689Z libnpp-12.3.3.65     | 130.6 MB  | ######9    |  70% [A[A[A[A[A
2025-05-07T20:25:39.1422071Z 
2025-05-07T20:25:39.1422092Z 
2025-05-07T20:25:39.1422098Z 
2025-05-07T20:25:39.1422103Z 
2025-05-07T20:25:39.1422108Z 
2025-05-07T20:25:39.1422112Z 
2025-05-07T20:25:39.1423647Z 
2025-05-07T20:25:39.2070684Z cuda-nvvp-12.8.57    | 112.4 MB  | ######7    |  68% [A[A[A[A[A[A[A
2025-05-07T20:25:39.2192394Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  77% 
2025-05-07T20:25:39.2192763Z 
2025-05-07T20:25:39.2192770Z 
2025-05-07T20:25:39.2192775Z 
2025-05-07T20:25:39.2192781Z 
2025-05-07T20:25:39.2192786Z 
2025-05-07T20:25:39.2192821Z 
2025-05-07T20:25:39.2380465Z cuda-nsight-12.8.55  | 113.2 MB  | ########3  |  84% [A[A[A[A[A[A
2025-05-07T20:25:39.2381002Z 
2025-05-07T20:25:39.2381009Z 
2025-05-07T20:25:39.2381014Z 
2025-05-07T20:25:39.2381020Z 
2025-05-07T20:25:39.2381441Z 
2025-05-07T20:25:39.2505761Z libnpp-12.3.3.65     | 130.6 MB  | #######1   |  72% [A[A[A[A[A
2025-05-07T20:25:39.2506080Z 
2025-05-07T20:25:39.2506084Z 
2025-05-07T20:25:39.2506088Z 
2025-05-07T20:25:39.2506091Z 
2025-05-07T20:25:39.2506095Z 
2025-05-07T20:25:39.2506099Z 
2025-05-07T20:25:39.2506102Z 
2025-05-07T20:25:39.3074893Z cuda-nvvp-12.8.57    | 112.4 MB  | #######    |  70% [A[A[A[A[A[A[A
2025-05-07T20:25:39.3385430Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  78% 
2025-05-07T20:25:39.3385743Z 
2025-05-07T20:25:39.3385747Z 
2025-05-07T20:25:39.3385751Z 
2025-05-07T20:25:39.3385757Z 
2025-05-07T20:25:39.3385761Z 
2025-05-07T20:25:39.3508240Z libnpp-12.3.3.65     | 130.6 MB  | #######4   |  74% [A[A[A[A[A
2025-05-07T20:25:39.3508842Z 
2025-05-07T20:25:39.3508859Z 
2025-05-07T20:25:39.3508863Z 
2025-05-07T20:25:39.3508867Z 
2025-05-07T20:25:39.3508871Z 
2025-05-07T20:25:39.3508874Z 
2025-05-07T20:25:39.3508878Z 
2025-05-07T20:25:39.4077098Z cuda-nvvp-12.8.57    | 112.4 MB  | #######2   |  73% [A[A[A[A[A[A[A
2025-05-07T20:25:39.4375228Z libcublas-12.8.3.14  | 460.2 MB  | #######8   |  78% 
2025-05-07T20:25:39.4375520Z 
2025-05-07T20:25:39.4375524Z 
2025-05-07T20:25:39.4375535Z 
2025-05-07T20:25:39.4375539Z 
2025-05-07T20:25:39.4375543Z 
2025-05-07T20:25:39.4378042Z 
2025-05-07T20:25:39.4389868Z cuda-nsight-12.8.55  | 113.2 MB  | ########5  |  86% [A[A[A[A[A[A
2025-05-07T20:25:39.4390263Z 
2025-05-07T20:25:39.4390268Z 
2025-05-07T20:25:39.4390272Z 
2025-05-07T20:25:39.4390275Z 
2025-05-07T20:25:39.4393556Z 
2025-05-07T20:25:39.4511072Z libnpp-12.3.3.65     | 130.6 MB  | #######6   |  77% [A[A[A[A[A
2025-05-07T20:25:39.4511381Z 
2025-05-07T20:25:39.4511387Z 
2025-05-07T20:25:39.4511391Z 
2025-05-07T20:25:39.4511422Z 
2025-05-07T20:25:39.4511436Z 
2025-05-07T20:25:39.4511440Z 
2025-05-07T20:25:39.4511444Z 
2025-05-07T20:25:39.5079671Z cuda-nvvp-12.8.57    | 112.4 MB  | #######5   |  76% [A[A[A[A[A[A[A
2025-05-07T20:25:39.5375646Z libcublas-12.8.3.14  | 460.2 MB  | #######8   |  79% 
2025-05-07T20:25:39.5375987Z 
2025-05-07T20:25:39.5375993Z 
2025-05-07T20:25:39.5375998Z 
2025-05-07T20:25:39.5376018Z 
2025-05-07T20:25:39.5376028Z 
2025-05-07T20:25:39.5376073Z 
2025-05-07T20:25:39.5504840Z cuda-nsight-12.8.55  | 113.2 MB  | ########8  |  88% [A[A[A[A[A[A
2025-05-07T20:25:39.5505176Z 
2025-05-07T20:25:39.5505182Z 
2025-05-07T20:25:39.5505186Z 
2025-05-07T20:25:39.5505191Z 
2025-05-07T20:25:39.5509098Z 
2025-05-07T20:25:39.5535365Z libnpp-12.3.3.65     | 130.6 MB  | #######8   |  79% [A[A[A[A[A
2025-05-07T20:25:39.5535661Z 
2025-05-07T20:25:39.5535666Z 
2025-05-07T20:25:39.5535670Z 
2025-05-07T20:25:39.5535674Z 
2025-05-07T20:25:39.5535677Z 
2025-05-07T20:25:39.5535681Z 
2025-05-07T20:25:39.5541526Z 
2025-05-07T20:25:39.6121808Z cuda-nvvp-12.8.57    | 112.4 MB  | #######8   |  78% [A[A[A[A[A[A[A
2025-05-07T20:25:39.6376049Z libcublas-12.8.3.14  | 460.2 MB  | #######9   |  80% 
2025-05-07T20:25:39.6376466Z 
2025-05-07T20:25:39.6376472Z 
2025-05-07T20:25:39.6376478Z 
2025-05-07T20:25:39.6376483Z 
2025-05-07T20:25:39.6376488Z 
2025-05-07T20:25:39.6379945Z 
2025-05-07T20:25:39.6540773Z cuda-nsight-12.8.55  | 113.2 MB  | #########  |  91% [A[A[A[A[A[A
2025-05-07T20:25:39.6541097Z 
2025-05-07T20:25:39.6541101Z 
2025-05-07T20:25:39.6541105Z 
2025-05-07T20:25:39.6541108Z 
2025-05-07T20:25:39.6545805Z 
2025-05-07T20:25:39.6585879Z libnpp-12.3.3.65     | 130.6 MB  | ########   |  81% [A[A[A[A[A
2025-05-07T20:25:39.6586193Z 
2025-05-07T20:25:39.6586199Z 
2025-05-07T20:25:39.6586204Z 
2025-05-07T20:25:39.6586210Z 
2025-05-07T20:25:39.6586215Z 
2025-05-07T20:25:39.6586220Z 
2025-05-07T20:25:39.6586240Z 
2025-05-07T20:25:39.7382241Z cuda-nvvp-12.8.57    | 112.4 MB  | ########   |  81% [A[A[A[A[A[A[A
2025-05-07T20:25:39.7383174Z 
2025-05-07T20:25:39.7383184Z 
2025-05-07T20:25:39.7383188Z 
2025-05-07T20:25:39.7383207Z 
2025-05-07T20:25:39.7383210Z 
2025-05-07T20:25:39.7386105Z 
2025-05-07T20:25:39.7397143Z cuda-nsight-12.8.55  | 113.2 MB  | #########3 |  93% [A[A[A[A[A[A
2025-05-07T20:25:39.7561140Z libcublas-12.8.3.14  | 460.2 MB  | ########   |  80% 
2025-05-07T20:25:39.7561414Z 
2025-05-07T20:25:39.7561419Z 
2025-05-07T20:25:39.7561422Z 
2025-05-07T20:25:39.7561748Z 
2025-05-07T20:25:39.7563738Z 
2025-05-07T20:25:39.7588949Z libnpp-12.3.3.65     | 130.6 MB  | ########2  |  83% [A[A[A[A[A
2025-05-07T20:25:39.7589383Z 
2025-05-07T20:25:39.7589389Z 
2025-05-07T20:25:39.7589394Z 
2025-05-07T20:25:39.7589400Z 
2025-05-07T20:25:39.7589405Z 
2025-05-07T20:25:39.7589410Z 
2025-05-07T20:25:39.7593607Z 
2025-05-07T20:25:39.8401012Z cuda-nvvp-12.8.57    | 112.4 MB  | ########3  |  83% [A[A[A[A[A[A[A
2025-05-07T20:25:39.8452291Z libcublas-12.8.3.14  | 460.2 MB  | ########   |  81% 
2025-05-07T20:25:39.8452892Z 
2025-05-07T20:25:39.8452910Z 
2025-05-07T20:25:39.8452914Z 
2025-05-07T20:25:39.8452917Z 
2025-05-07T20:25:39.8452921Z 
2025-05-07T20:25:39.8456665Z 
2025-05-07T20:25:39.8613871Z cuda-nsight-12.8.55  | 113.2 MB  | #########5 |  96% [A[A[A[A[A[A
2025-05-07T20:25:39.8614208Z 
2025-05-07T20:25:39.8614213Z 
2025-05-07T20:25:39.8614216Z 
2025-05-07T20:25:39.8614220Z 
2025-05-07T20:25:39.8614224Z 
2025-05-07T20:25:39.8675841Z libnpp-12.3.3.65     | 130.6 MB  | ########4  |  85% [A[A[A[A[A
2025-05-07T20:25:39.8676156Z 
2025-05-07T20:25:39.8676161Z 
2025-05-07T20:25:39.8676165Z 
2025-05-07T20:25:39.8676169Z 
2025-05-07T20:25:39.8676173Z 
2025-05-07T20:25:39.8676177Z 
2025-05-07T20:25:39.8676180Z 
2025-05-07T20:25:39.9411850Z cuda-nvvp-12.8.57    | 112.4 MB  | ########5  |  86% [A[A[A[A[A[A[A
2025-05-07T20:25:39.9587877Z libcublas-12.8.3.14  | 460.2 MB  | ########1  |  81% 
2025-05-07T20:25:39.9588210Z 
2025-05-07T20:25:39.9588216Z 
2025-05-07T20:25:39.9588221Z 
2025-05-07T20:25:39.9588260Z 
2025-05-07T20:25:39.9588283Z 
2025-05-07T20:25:39.9588290Z 
2025-05-07T20:25:39.9648876Z cuda-nsight-12.8.55  | 113.2 MB  | #########8 |  98% [A[A[A[A[A[A
2025-05-07T20:25:39.9649201Z 
2025-05-07T20:25:39.9649204Z 
2025-05-07T20:25:39.9649208Z 
2025-05-07T20:25:39.9649212Z 
2025-05-07T20:25:39.9652737Z 
2025-05-07T20:25:39.9878728Z libnpp-12.3.3.65     | 130.6 MB  | ########7  |  87% [A[A[A[A[A
2025-05-07T20:25:39.9879133Z 
2025-05-07T20:25:39.9879139Z 
2025-05-07T20:25:39.9879144Z 
2025-05-07T20:25:39.9879150Z 
2025-05-07T20:25:39.9879155Z 
2025-05-07T20:25:39.9879161Z 
2025-05-07T20:25:39.9881917Z 
2025-05-07T20:25:40.0655157Z cuda-nvvp-12.8.57    | 112.4 MB  | ########8  |  88% [A[A[A[A[A[A[A
2025-05-07T20:25:40.0655639Z 
2025-05-07T20:25:40.0655645Z 
2025-05-07T20:25:40.0655651Z 
2025-05-07T20:25:40.0655656Z 
2025-05-07T20:25:40.0655674Z 
2025-05-07T20:25:40.0880032Z libnpp-12.3.3.65     | 130.6 MB  | ########9  |  90% [A[A[A[A[A
2025-05-07T20:25:40.0880334Z 
2025-05-07T20:25:40.0880368Z 
2025-05-07T20:25:40.0880383Z 
2025-05-07T20:25:40.0880387Z 
2025-05-07T20:25:40.0880399Z 
2025-05-07T20:25:40.0880402Z 
2025-05-07T20:25:40.0880438Z 
2025-05-07T20:25:40.1658524Z cuda-nvvp-12.8.57    | 112.4 MB  | #########1 |  91% [A[A[A[A[A[A[A
2025-05-07T20:25:40.1658879Z 
2025-05-07T20:25:40.1658883Z 
2025-05-07T20:25:40.1658887Z 
2025-05-07T20:25:40.1658890Z 
2025-05-07T20:25:40.1662417Z 
2025-05-07T20:25:40.1735327Z libnpp-12.3.3.65     | 130.6 MB  | #########2 |  92% [A[A[A[A[A
2025-05-07T20:25:40.1883573Z libcublas-12.8.3.14  | 460.2 MB  | ########1  |  82% 
2025-05-07T20:25:40.1883861Z 
2025-05-07T20:25:40.1883865Z 
2025-05-07T20:25:40.1883869Z 
2025-05-07T20:25:40.1883879Z 
2025-05-07T20:25:40.1883883Z 
2025-05-07T20:25:40.1883887Z 
2025-05-07T20:25:40.1883891Z 
2025-05-07T20:25:40.2738528Z cuda-nvvp-12.8.57    | 112.4 MB  | #########4 |  94% [A[A[A[A[A[A[A
2025-05-07T20:25:40.2759689Z libcublas-12.8.3.14  | 460.2 MB  | ########2  |  82% 
2025-05-07T20:25:40.2759993Z 
2025-05-07T20:25:40.2760229Z 
2025-05-07T20:25:40.2760234Z 
2025-05-07T20:25:40.2760238Z 
2025-05-07T20:25:40.2760242Z 
2025-05-07T20:25:40.2884529Z libnpp-12.3.3.65     | 130.6 MB  | #########4 |  94% [A[A[A[A[A
2025-05-07T20:25:40.2884833Z 
2025-05-07T20:25:40.2884837Z 
2025-05-07T20:25:40.2884841Z 
2025-05-07T20:25:40.2884844Z 
2025-05-07T20:25:40.2884848Z 
2025-05-07T20:25:40.2884852Z 
2025-05-07T20:25:40.2884855Z 
2025-05-07T20:25:40.3741226Z cuda-nvvp-12.8.57    | 112.4 MB  | #########6 |  97% [A[A[A[A[A[A[A
2025-05-07T20:25:40.3766378Z libcublas-12.8.3.14  | 460.2 MB  | ########3  |  83% 
2025-05-07T20:25:40.3766647Z 
2025-05-07T20:25:40.3766652Z 
2025-05-07T20:25:40.3766656Z 
2025-05-07T20:25:40.3766659Z 
2025-05-07T20:25:40.3766663Z 
2025-05-07T20:25:40.3890609Z libnpp-12.3.3.65     | 130.6 MB  | #########6 |  96% [A[A[A[A[A
2025-05-07T20:25:40.3890915Z 
2025-05-07T20:25:40.3890919Z 
2025-05-07T20:25:40.3890923Z 
2025-05-07T20:25:40.3890927Z 
2025-05-07T20:25:40.3890930Z 
2025-05-07T20:25:40.3891210Z 
2025-05-07T20:25:40.3891214Z 
2025-05-07T20:25:40.4742526Z cuda-nvvp-12.8.57    | 112.4 MB  | #########9 |  99% [A[A[A[A[A[A[A
2025-05-07T20:25:40.4768810Z libcublas-12.8.3.14  | 460.2 MB  | ########3  |  84% 
2025-05-07T20:25:40.4769090Z 
2025-05-07T20:25:40.4769094Z 
2025-05-07T20:25:40.4769098Z 
2025-05-07T20:25:40.4769102Z 
2025-05-07T20:25:40.4770466Z 
2025-05-07T20:25:40.5747263Z libnpp-12.3.3.65     | 130.6 MB  | #########9 |  99% [A[A[A[A[A
2025-05-07T20:25:40.6748601Z libcublas-12.8.3.14  | 460.2 MB  | ########4  |  85% 
2025-05-07T20:25:40.8015868Z libcublas-12.8.3.14  | 460.2 MB  | ########5  |  85% 
2025-05-07T20:25:40.9018905Z libcublas-12.8.3.14  | 460.2 MB  | ########5  |  86% 
2025-05-07T20:25:41.0020910Z libcublas-12.8.3.14  | 460.2 MB  | ########6  |  87% 
2025-05-07T20:25:41.1028374Z libcublas-12.8.3.14  | 460.2 MB  | ########7  |  87% 
2025-05-07T20:25:41.2032351Z libcublas-12.8.3.14  | 460.2 MB  | ########8  |  88% 
2025-05-07T20:25:41.3502528Z libcublas-12.8.3.14  | 460.2 MB  | ########9  |  89% 
2025-05-07T20:25:41.4503228Z libcublas-12.8.3.14  | 460.2 MB  | ########9  |  90% 
2025-05-07T20:25:41.5693006Z libcublas-12.8.3.14  | 460.2 MB  | #########  |  91% 
2025-05-07T20:25:41.6947765Z libcublas-12.8.3.14  | 460.2 MB  | #########1 |  91% 
2025-05-07T20:25:41.7949437Z libcublas-12.8.3.14  | 460.2 MB  | #########2 |  92% 
2025-05-07T20:25:41.8950137Z libcublas-12.8.3.14  | 460.2 MB  | #########2 |  93% 
2025-05-07T20:25:41.9954570Z libcublas-12.8.3.14  | 460.2 MB  | #########3 |  94% 
2025-05-07T20:25:42.1052244Z libcublas-12.8.3.14  | 460.2 MB  | #########4 |  94% 
2025-05-07T20:25:42.2062765Z libcublas-12.8.3.14  | 460.2 MB  | #########5 |  95% 
2025-05-07T20:25:42.3064432Z libcublas-12.8.3.14  | 460.2 MB  | #########5 |  96% 
2025-05-07T20:25:42.4068433Z libcublas-12.8.3.14  | 460.2 MB  | #########6 |  97% 
2025-05-07T20:25:42.5136294Z libcublas-12.8.3.14  | 460.2 MB  | #########7 |  97% 
2025-05-07T20:25:42.6198836Z libcublas-12.8.3.14  | 460.2 MB  | #########8 |  98% 
2025-05-07T20:25:42.7225342Z libcublas-12.8.3.14  | 460.2 MB  | #########8 |  99% 
2025-05-07T20:25:43.8362223Z libcublas-12.8.3.14  | 460.2 MB  | #########9 | 100% 
2025-05-07T20:25:43.8362630Z 
2025-05-07T20:25:43.8362637Z 
2025-05-07T20:25:43.8362642Z 
2025-05-07T20:25:43.8362647Z 
2025-05-07T20:25:43.8362652Z 
2025-05-07T20:25:43.8365145Z 
2025-05-07T20:25:43.9229831Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:43.9230251Z 
2025-05-07T20:25:43.9230255Z 
2025-05-07T20:25:43.9230259Z 
2025-05-07T20:25:43.9230263Z 
2025-05-07T20:25:43.9230267Z 
2025-05-07T20:25:43.9230270Z 
2025-05-07T20:25:43.9230274Z 
2025-05-07T20:25:43.9234710Z 
2025-05-07T20:25:44.0233468Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.0233855Z 
2025-05-07T20:25:44.0233861Z 
2025-05-07T20:25:44.0233866Z 
2025-05-07T20:25:44.0233872Z 
2025-05-07T20:25:44.0233877Z 
2025-05-07T20:25:44.0233893Z 
2025-05-07T20:25:44.0233923Z 
2025-05-07T20:25:44.0234186Z 
2025-05-07T20:25:44.0961554Z cuda-nvrtc-12.8.61   | 63.1 MB   | 5          |   6% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.0961878Z 
2025-05-07T20:25:44.0961883Z 
2025-05-07T20:25:44.0961895Z 
2025-05-07T20:25:44.0962988Z 
2025-05-07T20:25:44.1308462Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:44.1308768Z 
2025-05-07T20:25:44.1308783Z 
2025-05-07T20:25:44.1308787Z 
2025-05-07T20:25:44.1308791Z 
2025-05-07T20:25:44.1308795Z 
2025-05-07T20:25:44.1308799Z 
2025-05-07T20:25:44.1308802Z 
2025-05-07T20:25:44.1308806Z 
2025-05-07T20:25:44.2309252Z cuda-nvrtc-12.8.61   | 63.1 MB   | #1         |  11% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.2309825Z 
2025-05-07T20:25:44.2309832Z 
2025-05-07T20:25:44.2309838Z 
2025-05-07T20:25:44.2309843Z 
2025-05-07T20:25:44.2309848Z 
2025-05-07T20:25:44.2309854Z 
2025-05-07T20:25:44.2309859Z 
2025-05-07T20:25:44.2309864Z 
2025-05-07T20:25:44.3309318Z cuda-nvrtc-12.8.61   | 63.1 MB   | #6         |  16% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.3310193Z 
2025-05-07T20:25:44.3310197Z 
2025-05-07T20:25:44.3310201Z 
2025-05-07T20:25:44.3310205Z 
2025-05-07T20:25:44.3310209Z 
2025-05-07T20:25:44.3310212Z 
2025-05-07T20:25:44.3310216Z 
2025-05-07T20:25:44.3310220Z 
2025-05-07T20:25:44.4049225Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##1        |  22% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.4049652Z 
2025-05-07T20:25:44.4049656Z 
2025-05-07T20:25:44.4049660Z 
2025-05-07T20:25:44.4049664Z 
2025-05-07T20:25:44.4049667Z 
2025-05-07T20:25:44.4049671Z 
2025-05-07T20:25:44.4051642Z 
2025-05-07T20:25:44.4317835Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:44.4318211Z 
2025-05-07T20:25:44.4318223Z 
2025-05-07T20:25:44.4318227Z 
2025-05-07T20:25:44.4318231Z 
2025-05-07T20:25:44.4318234Z 
2025-05-07T20:25:44.4318238Z 
2025-05-07T20:25:44.4318242Z 
2025-05-07T20:25:44.4318245Z 
2025-05-07T20:25:44.4537619Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##7        |  27% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.4538021Z 
2025-05-07T20:25:44.4538027Z 
2025-05-07T20:25:44.4538033Z 
2025-05-07T20:25:44.4538038Z 
2025-05-07T20:25:44.4538043Z 
2025-05-07T20:25:44.4538048Z 
2025-05-07T20:25:44.4538053Z 
2025-05-07T20:25:44.4538058Z 
2025-05-07T20:25:44.4538063Z 
2025-05-07T20:25:44.5345915Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5346247Z 
2025-05-07T20:25:44.5346251Z 
2025-05-07T20:25:44.5346255Z 
2025-05-07T20:25:44.5346258Z 
2025-05-07T20:25:44.5346262Z 
2025-05-07T20:25:44.5346266Z 
2025-05-07T20:25:44.5346269Z 
2025-05-07T20:25:44.5346273Z 
2025-05-07T20:25:44.5548324Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###2       |  33% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.5548651Z 
2025-05-07T20:25:44.5548655Z 
2025-05-07T20:25:44.5548659Z 
2025-05-07T20:25:44.5548662Z 
2025-05-07T20:25:44.5548666Z 
2025-05-07T20:25:44.5548670Z 
2025-05-07T20:25:44.5548673Z 
2025-05-07T20:25:44.5548677Z 
2025-05-07T20:25:44.5548681Z 
2025-05-07T20:25:44.6433525Z libcurand-10.3.9.55  | 43.6 MB   | 7          |   8% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6433945Z 
2025-05-07T20:25:44.6433952Z 
2025-05-07T20:25:44.6433957Z 
2025-05-07T20:25:44.6433963Z 
2025-05-07T20:25:44.6433968Z 
2025-05-07T20:25:44.6433973Z 
2025-05-07T20:25:44.6433978Z 
2025-05-07T20:25:44.6435783Z 
2025-05-07T20:25:44.6651258Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###8       |  38% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.6651707Z 
2025-05-07T20:25:44.6651714Z 
2025-05-07T20:25:44.6651719Z 
2025-05-07T20:25:44.6651727Z 
2025-05-07T20:25:44.6651731Z 
2025-05-07T20:25:44.6651734Z 
2025-05-07T20:25:44.6651739Z 
2025-05-07T20:25:44.6651743Z 
2025-05-07T20:25:44.6653263Z 
2025-05-07T20:25:44.7529048Z libcurand-10.3.9.55  | 43.6 MB   | #5         |  16% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7529475Z 
2025-05-07T20:25:44.7529482Z 
2025-05-07T20:25:44.7529487Z 
2025-05-07T20:25:44.7529493Z 
2025-05-07T20:25:44.7529498Z 
2025-05-07T20:25:44.7529504Z 
2025-05-07T20:25:44.7529530Z 
2025-05-07T20:25:44.7535890Z 
2025-05-07T20:25:44.7804780Z cuda-nvrtc-12.8.61   | 63.1 MB   | ####3      |  43% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7805201Z 
2025-05-07T20:25:44.7805208Z 
2025-05-07T20:25:44.7805215Z 
2025-05-07T20:25:44.7805221Z 
2025-05-07T20:25:44.7805228Z 
2025-05-07T20:25:44.7805245Z 
2025-05-07T20:25:44.7805250Z 
2025-05-07T20:25:44.7805255Z 
2025-05-07T20:25:44.7805261Z 
2025-05-07T20:25:44.7878462Z libcurand-10.3.9.55  | 43.6 MB   | ##2        |  23% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.7878872Z 
2025-05-07T20:25:44.7878884Z 
2025-05-07T20:25:44.7878888Z 
2025-05-07T20:25:44.7878892Z 
2025-05-07T20:25:44.7878896Z 
2025-05-07T20:25:44.8461821Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:44.8462212Z 
2025-05-07T20:25:44.8462217Z 
2025-05-07T20:25:44.8462223Z 
2025-05-07T20:25:44.8462227Z 
2025-05-07T20:25:44.8462231Z 
2025-05-07T20:25:44.8462237Z 
2025-05-07T20:25:44.8462242Z 
2025-05-07T20:25:44.8462247Z 
2025-05-07T20:25:44.8462561Z 
2025-05-07T20:25:44.8466388Z 
2025-05-07T20:25:44.8647179Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8647518Z 
2025-05-07T20:25:44.8647522Z 
2025-05-07T20:25:44.8647526Z 
2025-05-07T20:25:44.8647530Z 
2025-05-07T20:25:44.8647534Z 
2025-05-07T20:25:44.8647537Z 
2025-05-07T20:25:44.8647542Z 
2025-05-07T20:25:44.8648294Z 
2025-05-07T20:25:44.8840593Z cuda-nvrtc-12.8.61   | 63.1 MB   | ####8      |  48% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.8840914Z 
2025-05-07T20:25:44.8840918Z 
2025-05-07T20:25:44.8840922Z 
2025-05-07T20:25:44.8840925Z 
2025-05-07T20:25:44.8840929Z 
2025-05-07T20:25:44.8840933Z 
2025-05-07T20:25:44.8840936Z 
2025-05-07T20:25:44.8840940Z 
2025-05-07T20:25:44.8842500Z 
2025-05-07T20:25:44.9462534Z libcurand-10.3.9.55  | 43.6 MB   | ##9        |  30% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9463016Z 
2025-05-07T20:25:44.9463023Z 
2025-05-07T20:25:44.9463030Z 
2025-05-07T20:25:44.9463036Z 
2025-05-07T20:25:44.9463065Z 
2025-05-07T20:25:44.9463085Z 
2025-05-07T20:25:44.9463091Z 
2025-05-07T20:25:44.9463096Z 
2025-05-07T20:25:44.9463111Z 
2025-05-07T20:25:44.9463117Z 
2025-05-07T20:25:44.9734836Z gds-tools-1.13.0.11  | 37.9 MB   | 7          |   7% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9735162Z 
2025-05-07T20:25:44.9735166Z 
2025-05-07T20:25:44.9735170Z 
2025-05-07T20:25:44.9735181Z 
2025-05-07T20:25:44.9735185Z 
2025-05-07T20:25:44.9735188Z 
2025-05-07T20:25:44.9735192Z 
2025-05-07T20:25:44.9736921Z 
2025-05-07T20:25:44.9897158Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####3     |  53% [A[A[A[A[A[A[A[A
2025-05-07T20:25:44.9897580Z 
2025-05-07T20:25:44.9897584Z 
2025-05-07T20:25:44.9897588Z 
2025-05-07T20:25:44.9897592Z 
2025-05-07T20:25:44.9897595Z 
2025-05-07T20:25:44.9897600Z 
2025-05-07T20:25:44.9897604Z 
2025-05-07T20:25:44.9897608Z 
2025-05-07T20:25:44.9897611Z 
2025-05-07T20:25:45.0464599Z libcurand-10.3.9.55  | 43.6 MB   | ###6       |  37% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0465202Z 
2025-05-07T20:25:45.0465225Z 
2025-05-07T20:25:45.0465232Z 
2025-05-07T20:25:45.0465238Z 
2025-05-07T20:25:45.0465243Z 
2025-05-07T20:25:45.0465248Z 
2025-05-07T20:25:45.0465253Z 
2025-05-07T20:25:45.0465259Z 
2025-05-07T20:25:45.0465264Z 
2025-05-07T20:25:45.0470036Z 
2025-05-07T20:25:45.0900076Z gds-tools-1.13.0.11  | 37.9 MB   | #4         |  15% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0900517Z 
2025-05-07T20:25:45.0900522Z 
2025-05-07T20:25:45.0900525Z 
2025-05-07T20:25:45.0900529Z 
2025-05-07T20:25:45.0900533Z 
2025-05-07T20:25:45.0900536Z 
2025-05-07T20:25:45.0900540Z 
2025-05-07T20:25:45.0900544Z 
2025-05-07T20:25:45.0902752Z 
2025-05-07T20:25:45.0930995Z libcurand-10.3.9.55  | 43.6 MB   | ####3      |  44% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.0931408Z 
2025-05-07T20:25:45.0931413Z 
2025-05-07T20:25:45.0931418Z 
2025-05-07T20:25:45.0931424Z 
2025-05-07T20:25:45.0931429Z 
2025-05-07T20:25:45.0931446Z 
2025-05-07T20:25:45.0931452Z 
2025-05-07T20:25:45.0931457Z 
2025-05-07T20:25:45.1581385Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####8     |  58% [A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1581770Z 
2025-05-07T20:25:45.1581786Z 
2025-05-07T20:25:45.1581790Z 
2025-05-07T20:25:45.1581794Z 
2025-05-07T20:25:45.1581927Z 
2025-05-07T20:25:45.1581932Z 
2025-05-07T20:25:45.1581935Z 
2025-05-07T20:25:45.1581939Z 
2025-05-07T20:25:45.1581943Z 
2025-05-07T20:25:45.1584080Z 
2025-05-07T20:25:45.1972078Z gds-tools-1.13.0.11  | 37.9 MB   | ##2        |  23% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1972444Z 
2025-05-07T20:25:45.1972449Z 
2025-05-07T20:25:45.1972455Z 
2025-05-07T20:25:45.1972459Z 
2025-05-07T20:25:45.1972464Z 
2025-05-07T20:25:45.1972469Z 
2025-05-07T20:25:45.1972475Z 
2025-05-07T20:25:45.1972480Z 
2025-05-07T20:25:45.1972486Z 
2025-05-07T20:25:45.1978876Z libcurand-10.3.9.55  | 43.6 MB   | #####      |  50% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.1979215Z 
2025-05-07T20:25:45.1979525Z 
2025-05-07T20:25:45.1979529Z 
2025-05-07T20:25:45.1979533Z 
2025-05-07T20:25:45.1979773Z 
2025-05-07T20:25:45.1979788Z 
2025-05-07T20:25:45.1979792Z 
2025-05-07T20:25:45.1981024Z 
2025-05-07T20:25:45.2584616Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######2    |  63% [A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2585047Z 
2025-05-07T20:25:45.2585052Z 
2025-05-07T20:25:45.2585056Z 
2025-05-07T20:25:45.2585060Z 
2025-05-07T20:25:45.2585063Z 
2025-05-07T20:25:45.2585067Z 
2025-05-07T20:25:45.2585071Z 
2025-05-07T20:25:45.2585075Z 
2025-05-07T20:25:45.2585079Z 
2025-05-07T20:25:45.2585091Z 
2025-05-07T20:25:45.2982498Z gds-tools-1.13.0.11  | 37.9 MB   | ###        |  31% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.2983048Z 
2025-05-07T20:25:45.2983053Z 
2025-05-07T20:25:45.2983057Z 
2025-05-07T20:25:45.2983062Z 
2025-05-07T20:25:45.2983078Z 
2025-05-07T20:25:45.2983082Z 
2025-05-07T20:25:45.2983086Z 
2025-05-07T20:25:45.2987704Z 
2025-05-07T20:25:45.3038355Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######7    |  67% [A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3038678Z 
2025-05-07T20:25:45.3038708Z 
2025-05-07T20:25:45.3038722Z 
2025-05-07T20:25:45.3038726Z 
2025-05-07T20:25:45.3038730Z 
2025-05-07T20:25:45.3038733Z 
2025-05-07T20:25:45.3038737Z 
2025-05-07T20:25:45.3038741Z 
2025-05-07T20:25:45.3040363Z 
2025-05-07T20:25:45.3590885Z libcurand-10.3.9.55  | 43.6 MB   | #####6     |  57% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.3591334Z 
2025-05-07T20:25:45.3591340Z 
2025-05-07T20:25:45.3591346Z 
2025-05-07T20:25:45.3591351Z 
2025-05-07T20:25:45.3591356Z 
2025-05-07T20:25:45.3591361Z 
2025-05-07T20:25:45.3591366Z 
2025-05-07T20:25:45.3591371Z 
2025-05-07T20:25:45.3591376Z 
2025-05-07T20:25:45.3591382Z 
2025-05-07T20:25:45.4068739Z gds-tools-1.13.0.11  | 37.9 MB   | ###9       |  40% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4069128Z 
2025-05-07T20:25:45.4069133Z 
2025-05-07T20:25:45.4069137Z 
2025-05-07T20:25:45.4069140Z 
2025-05-07T20:25:45.4069145Z 
2025-05-07T20:25:45.4069148Z 
2025-05-07T20:25:45.4069152Z 
2025-05-07T20:25:45.4073123Z 
2025-05-07T20:25:45.4096164Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######1   |  72% [A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4096485Z 
2025-05-07T20:25:45.4096489Z 
2025-05-07T20:25:45.4096493Z 
2025-05-07T20:25:45.4096497Z 
2025-05-07T20:25:45.4096507Z 
2025-05-07T20:25:45.4096510Z 
2025-05-07T20:25:45.4096514Z 
2025-05-07T20:25:45.4096518Z 
2025-05-07T20:25:45.4096521Z 
2025-05-07T20:25:45.4653500Z libcurand-10.3.9.55  | 43.6 MB   | ######3    |  63% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.4653977Z 
2025-05-07T20:25:45.4653982Z 
2025-05-07T20:25:45.4653987Z 
2025-05-07T20:25:45.4653992Z 
2025-05-07T20:25:45.4653997Z 
2025-05-07T20:25:45.4654002Z 
2025-05-07T20:25:45.4654007Z 
2025-05-07T20:25:45.4654013Z 
2025-05-07T20:25:45.4654018Z 
2025-05-07T20:25:45.4655573Z 
2025-05-07T20:25:45.5071708Z gds-tools-1.13.0.11  | 37.9 MB   | ####7      |  48% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5072189Z 
2025-05-07T20:25:45.5072197Z 
2025-05-07T20:25:45.5072202Z 
2025-05-07T20:25:45.5072207Z 
2025-05-07T20:25:45.5072213Z 
2025-05-07T20:25:45.5072250Z 
2025-05-07T20:25:45.5072523Z 
2025-05-07T20:25:45.5074617Z 
2025-05-07T20:25:45.5097142Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######6   |  76% [A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5097462Z 
2025-05-07T20:25:45.5097466Z 
2025-05-07T20:25:45.5097469Z 
2025-05-07T20:25:45.5097473Z 
2025-05-07T20:25:45.5097477Z 
2025-05-07T20:25:45.5097480Z 
2025-05-07T20:25:45.5097484Z 
2025-05-07T20:25:45.5097487Z 
2025-05-07T20:25:45.5097502Z 
2025-05-07T20:25:45.5655818Z libcurand-10.3.9.55  | 43.6 MB   | #######    |  70% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.5656163Z 
2025-05-07T20:25:45.5656167Z 
2025-05-07T20:25:45.5656171Z 
2025-05-07T20:25:45.5656182Z 
2025-05-07T20:25:45.5656186Z 
2025-05-07T20:25:45.5656189Z 
2025-05-07T20:25:45.5656193Z 
2025-05-07T20:25:45.5656196Z 
2025-05-07T20:25:45.5656200Z 
2025-05-07T20:25:45.5656203Z 
2025-05-07T20:25:45.6075432Z gds-tools-1.13.0.11  | 37.9 MB   | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6075822Z 
2025-05-07T20:25:45.6076113Z 
2025-05-07T20:25:45.6076132Z 
2025-05-07T20:25:45.6076138Z 
2025-05-07T20:25:45.6076143Z 
2025-05-07T20:25:45.6076148Z 
2025-05-07T20:25:45.6076154Z 
2025-05-07T20:25:45.6079916Z 
2025-05-07T20:25:45.6102056Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########1  |  81% [A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6102457Z 
2025-05-07T20:25:45.6102461Z 
2025-05-07T20:25:45.6102465Z 
2025-05-07T20:25:45.6102468Z 
2025-05-07T20:25:45.6102472Z 
2025-05-07T20:25:45.6102476Z 
2025-05-07T20:25:45.6102479Z 
2025-05-07T20:25:45.6102483Z 
2025-05-07T20:25:45.6105711Z 
2025-05-07T20:25:45.6671398Z libcurand-10.3.9.55  | 43.6 MB   | #######7   |  77% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.6671722Z 
2025-05-07T20:25:45.6671726Z 
2025-05-07T20:25:45.6671730Z 
2025-05-07T20:25:45.6671734Z 
2025-05-07T20:25:45.6671737Z 
2025-05-07T20:25:45.6671741Z 
2025-05-07T20:25:45.6671745Z 
2025-05-07T20:25:45.6671748Z 
2025-05-07T20:25:45.6671752Z 
2025-05-07T20:25:45.6671756Z 
2025-05-07T20:25:45.7092413Z gds-tools-1.13.0.11  | 37.9 MB   | ######3    |  63% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7092828Z 
2025-05-07T20:25:45.7092835Z 
2025-05-07T20:25:45.7092840Z 
2025-05-07T20:25:45.7092845Z 
2025-05-07T20:25:45.7092850Z 
2025-05-07T20:25:45.7092867Z 
2025-05-07T20:25:45.7092873Z 
2025-05-07T20:25:45.7092878Z 
2025-05-07T20:25:45.7114070Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########5  |  86% [A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7114516Z 
2025-05-07T20:25:45.7114530Z 
2025-05-07T20:25:45.7114536Z 
2025-05-07T20:25:45.7114542Z 
2025-05-07T20:25:45.7114547Z 
2025-05-07T20:25:45.7114552Z 
2025-05-07T20:25:45.7114557Z 
2025-05-07T20:25:45.7114562Z 
2025-05-07T20:25:45.7114567Z 
2025-05-07T20:25:45.7791602Z libcurand-10.3.9.55  | 43.6 MB   | ########4  |  84% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.7792062Z 
2025-05-07T20:25:45.7792067Z 
2025-05-07T20:25:45.7792071Z 
2025-05-07T20:25:45.7792074Z 
2025-05-07T20:25:45.7792078Z 
2025-05-07T20:25:45.7792082Z 
2025-05-07T20:25:45.7792087Z 
2025-05-07T20:25:45.7792119Z 
2025-05-07T20:25:45.7792131Z 
2025-05-07T20:25:45.7792135Z 
2025-05-07T20:25:45.8098507Z gds-tools-1.13.0.11  | 37.9 MB   | #######    |  71% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.8098837Z 
2025-05-07T20:25:45.8098841Z 
2025-05-07T20:25:45.8098845Z 
2025-05-07T20:25:45.8098849Z 
2025-05-07T20:25:45.8098852Z 
2025-05-07T20:25:45.8098856Z 
2025-05-07T20:25:45.8098860Z 
2025-05-07T20:25:45.8098863Z 
2025-05-07T20:25:45.8114704Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########  |  91% [A[A[A[A[A[A[A[A
2025-05-07T20:25:45.8115100Z 
2025-05-07T20:25:45.8115104Z 
2025-05-07T20:25:45.8115108Z 
2025-05-07T20:25:45.8115112Z 
2025-05-07T20:25:45.8115115Z 
2025-05-07T20:25:45.8115119Z 
2025-05-07T20:25:45.8115123Z 
2025-05-07T20:25:45.8115126Z 
2025-05-07T20:25:45.8115130Z 
2025-05-07T20:25:45.8793634Z libcurand-10.3.9.55  | 43.6 MB   | #########1 |  91% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.8793958Z 
2025-05-07T20:25:45.8793962Z 
2025-05-07T20:25:45.8793966Z 
2025-05-07T20:25:45.8793969Z 
2025-05-07T20:25:45.8794262Z 
2025-05-07T20:25:45.8794271Z 
2025-05-07T20:25:45.8794286Z 
2025-05-07T20:25:45.8794292Z 
2025-05-07T20:25:45.8794297Z 
2025-05-07T20:25:45.8794302Z 
2025-05-07T20:25:45.9105781Z gds-tools-1.13.0.11  | 37.9 MB   | #######8   |  79% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.9106210Z 
2025-05-07T20:25:45.9106223Z 
2025-05-07T20:25:45.9106229Z 
2025-05-07T20:25:45.9106234Z 
2025-05-07T20:25:45.9106238Z 
2025-05-07T20:25:45.9106243Z 
2025-05-07T20:25:45.9106248Z 
2025-05-07T20:25:45.9106253Z 
2025-05-07T20:25:45.9172314Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########5 |  95% [A[A[A[A[A[A[A[A
2025-05-07T20:25:45.9172622Z 
2025-05-07T20:25:45.9172626Z 
2025-05-07T20:25:45.9172630Z 
2025-05-07T20:25:45.9172634Z 
2025-05-07T20:25:45.9172637Z 
2025-05-07T20:25:45.9172641Z 
2025-05-07T20:25:45.9172645Z 
2025-05-07T20:25:45.9172648Z 
2025-05-07T20:25:45.9176212Z 
2025-05-07T20:25:45.9796853Z libcurand-10.3.9.55  | 43.6 MB   | #########8 |  98% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:45.9797566Z 
2025-05-07T20:25:45.9797572Z 
2025-05-07T20:25:45.9797578Z 
2025-05-07T20:25:45.9797583Z 
2025-05-07T20:25:45.9797587Z 
2025-05-07T20:25:45.9797592Z 
2025-05-07T20:25:45.9797597Z 
2025-05-07T20:25:45.9797603Z 
2025-05-07T20:25:45.9797608Z 
2025-05-07T20:25:45.9800674Z 
2025-05-07T20:25:46.0799559Z gds-tools-1.13.0.11  | 37.9 MB   | ########6  |  86% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:46.0799898Z 
2025-05-07T20:25:46.0799902Z 
2025-05-07T20:25:46.0799905Z 
2025-05-07T20:25:46.0799909Z 
2025-05-07T20:25:46.0799912Z 
2025-05-07T20:25:46.0799916Z 
2025-05-07T20:25:46.0799920Z 
2025-05-07T20:25:46.0799939Z 
2025-05-07T20:25:46.0799943Z 
2025-05-07T20:25:46.0800912Z 
2025-05-07T20:25:47.2803074Z gds-tools-1.13.0.11  | 37.9 MB   | #########4 |  94% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.2803518Z 
2025-05-07T20:25:47.2803523Z 
2025-05-07T20:25:47.2806018Z 
2025-05-07T20:25:47.3180464Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:47.3180911Z 
2025-05-07T20:25:47.3180930Z 
2025-05-07T20:25:47.3180936Z 
2025-05-07T20:25:47.3180941Z 
2025-05-07T20:25:47.3180946Z 
2025-05-07T20:25:47.3180952Z 
2025-05-07T20:25:47.3180957Z 
2025-05-07T20:25:47.3180962Z 
2025-05-07T20:25:47.3180967Z 
2025-05-07T20:25:47.3180972Z 
2025-05-07T20:25:47.3585946Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.3586381Z 
2025-05-07T20:25:47.3586387Z 
2025-05-07T20:25:47.3586391Z 
2025-05-07T20:25:47.3586397Z 
2025-05-07T20:25:47.3586402Z 
2025-05-07T20:25:47.3586407Z 
2025-05-07T20:25:47.3586412Z 
2025-05-07T20:25:47.3586417Z 
2025-05-07T20:25:47.3586422Z 
2025-05-07T20:25:47.3586427Z 
2025-05-07T20:25:47.3586432Z 
2025-05-07T20:25:47.4295863Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.4296309Z 
2025-05-07T20:25:47.4296315Z 
2025-05-07T20:25:47.4296320Z 
2025-05-07T20:25:47.4296325Z 
2025-05-07T20:25:47.4296330Z 
2025-05-07T20:25:47.4296335Z 
2025-05-07T20:25:47.4296373Z 
2025-05-07T20:25:47.4296391Z 
2025-05-07T20:25:47.4296398Z 
2025-05-07T20:25:47.4587508Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.4587933Z 
2025-05-07T20:25:47.4587947Z 
2025-05-07T20:25:47.4587952Z 
2025-05-07T20:25:47.4587957Z 
2025-05-07T20:25:47.4587963Z 
2025-05-07T20:25:47.4587968Z 
2025-05-07T20:25:47.4587973Z 
2025-05-07T20:25:47.4587978Z 
2025-05-07T20:25:47.4587983Z 
2025-05-07T20:25:47.4587988Z 
2025-05-07T20:25:47.4587993Z 
2025-05-07T20:25:47.5065403Z libnvjitlink-12.8.61 | 28.7 MB   | #          |  11% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.5065862Z 
2025-05-07T20:25:47.5065868Z 
2025-05-07T20:25:47.5065874Z 
2025-05-07T20:25:47.5065879Z 
2025-05-07T20:25:47.5065884Z 
2025-05-07T20:25:47.5065889Z 
2025-05-07T20:25:47.5065894Z 
2025-05-07T20:25:47.5065899Z 
2025-05-07T20:25:47.5065904Z 
2025-05-07T20:25:47.5065910Z 
2025-05-07T20:25:47.5065915Z 
2025-05-07T20:25:47.5066613Z 
2025-05-07T20:25:47.5588831Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.5589289Z 
2025-05-07T20:25:47.5589293Z 
2025-05-07T20:25:47.5589297Z 
2025-05-07T20:25:47.5589301Z 
2025-05-07T20:25:47.5589304Z 
2025-05-07T20:25:47.5589308Z 
2025-05-07T20:25:47.5589312Z 
2025-05-07T20:25:47.5589316Z 
2025-05-07T20:25:47.5589319Z 
2025-05-07T20:25:47.5589331Z 
2025-05-07T20:25:47.5589335Z 
2025-05-07T20:25:47.6066720Z libnvjitlink-12.8.61 | 28.7 MB   | ##2        |  22% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.6067168Z 
2025-05-07T20:25:47.6067174Z 
2025-05-07T20:25:47.6067187Z 
2025-05-07T20:25:47.6067192Z 
2025-05-07T20:25:47.6067197Z 
2025-05-07T20:25:47.6067202Z 
2025-05-07T20:25:47.6067209Z 
2025-05-07T20:25:47.6067214Z 
2025-05-07T20:25:47.6067219Z 
2025-05-07T20:25:47.6067224Z 
2025-05-07T20:25:47.6067229Z 
2025-05-07T20:25:47.6068762Z 
2025-05-07T20:25:47.6729012Z cuda-nvcc-tools-12.8 | 24.5 MB   | #2         |  12% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.6729836Z 
2025-05-07T20:25:47.6729841Z 
2025-05-07T20:25:47.6729845Z 
2025-05-07T20:25:47.6729849Z 
2025-05-07T20:25:47.6729852Z 
2025-05-07T20:25:47.6729856Z 
2025-05-07T20:25:47.6729860Z 
2025-05-07T20:25:47.6729864Z 
2025-05-07T20:25:47.6729867Z 
2025-05-07T20:25:47.6729871Z 
2025-05-07T20:25:47.6729875Z 
2025-05-07T20:25:47.7141947Z libnvjitlink-12.8.61 | 28.7 MB   | ###3       |  34% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.7142410Z 
2025-05-07T20:25:47.7142416Z 
2025-05-07T20:25:47.7142421Z 
2025-05-07T20:25:47.7142427Z 
2025-05-07T20:25:47.7142432Z 
2025-05-07T20:25:47.7142437Z 
2025-05-07T20:25:47.7142443Z 
2025-05-07T20:25:47.7142461Z 
2025-05-07T20:25:47.7142467Z 
2025-05-07T20:25:47.7142472Z 
2025-05-07T20:25:47.7142477Z 
2025-05-07T20:25:47.7142483Z 
2025-05-07T20:25:47.7794774Z cuda-nvcc-tools-12.8 | 24.5 MB   | ##4        |  24% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.7795236Z 
2025-05-07T20:25:47.7795242Z 
2025-05-07T20:25:47.7795248Z 
2025-05-07T20:25:47.7795273Z 
2025-05-07T20:25:47.7795290Z 
2025-05-07T20:25:47.7795296Z 
2025-05-07T20:25:47.7795301Z 
2025-05-07T20:25:47.7795306Z 
2025-05-07T20:25:47.7795311Z 
2025-05-07T20:25:47.7795316Z 
2025-05-07T20:25:47.7795892Z 
2025-05-07T20:25:47.8143629Z libnvjitlink-12.8.61 | 28.7 MB   | ####4      |  44% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.8144086Z 
2025-05-07T20:25:47.8144092Z 
2025-05-07T20:25:47.8144097Z 
2025-05-07T20:25:47.8144102Z 
2025-05-07T20:25:47.8144118Z 
2025-05-07T20:25:47.8144123Z 
2025-05-07T20:25:47.8144128Z 
2025-05-07T20:25:47.8144133Z 
2025-05-07T20:25:47.8144138Z 
2025-05-07T20:25:47.8144143Z 
2025-05-07T20:25:47.8144148Z 
2025-05-07T20:25:47.8144555Z 
2025-05-07T20:25:47.8408447Z cuda-nvcc-tools-12.8 | 24.5 MB   | ###5       |  36% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.8408913Z 
2025-05-07T20:25:47.8408918Z 
2025-05-07T20:25:47.8408923Z 
2025-05-07T20:25:47.8408929Z 
2025-05-07T20:25:47.8408934Z 
2025-05-07T20:25:47.8408939Z 
2025-05-07T20:25:47.8562469Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:47.8564954Z 
2025-05-07T20:25:47.8875706Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:25:47.8875995Z 
2025-05-07T20:25:47.8875999Z 
2025-05-07T20:25:47.8876002Z 
2025-05-07T20:25:47.8876006Z 
2025-05-07T20:25:47.8876010Z 
2025-05-07T20:25:47.8876013Z 
2025-05-07T20:25:47.8876018Z 
2025-05-07T20:25:47.8876022Z 
2025-05-07T20:25:47.8876025Z 
2025-05-07T20:25:47.8876029Z 
2025-05-07T20:25:47.8877118Z 
2025-05-07T20:25:47.9016707Z libnvjitlink-12.8.61 | 28.7 MB   | #####4     |  55% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.9017704Z 
2025-05-07T20:25:47.9017710Z 
2025-05-07T20:25:47.9017715Z 
2025-05-07T20:25:47.9017721Z 
2025-05-07T20:25:47.9017727Z 
2025-05-07T20:25:47.9017732Z 
2025-05-07T20:25:47.9017737Z 
2025-05-07T20:25:47.9017742Z 
2025-05-07T20:25:47.9017748Z 
2025-05-07T20:25:47.9017753Z 
2025-05-07T20:25:47.9017758Z 
2025-05-07T20:25:47.9017764Z 
2025-05-07T20:25:47.9017788Z 
2025-05-07T20:25:47.9178715Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.9179191Z 
2025-05-07T20:25:47.9179197Z 
2025-05-07T20:25:47.9179202Z 
2025-05-07T20:25:47.9179218Z 
2025-05-07T20:25:47.9179223Z 
2025-05-07T20:25:47.9179228Z 
2025-05-07T20:25:47.9179233Z 
2025-05-07T20:25:47.9179238Z 
2025-05-07T20:25:47.9179243Z 
2025-05-07T20:25:47.9179248Z 
2025-05-07T20:25:47.9179253Z 
2025-05-07T20:25:47.9179258Z 
2025-05-07T20:25:47.9254754Z cuda-nvcc-tools-12.8 | 24.5 MB   | ####7      |  48% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.9255216Z 
2025-05-07T20:25:47.9255222Z 
2025-05-07T20:25:47.9255227Z 
2025-05-07T20:25:47.9255232Z 
2025-05-07T20:25:47.9255237Z 
2025-05-07T20:25:47.9255242Z 
2025-05-07T20:25:47.9255247Z 
2025-05-07T20:25:47.9255252Z 
2025-05-07T20:25:47.9851161Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:47.9851593Z 
2025-05-07T20:25:47.9852943Z 
2025-05-07T20:25:47.9900567Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:25:47.9900968Z 
2025-05-07T20:25:47.9900974Z 
2025-05-07T20:25:47.9900979Z 
2025-05-07T20:25:47.9900984Z 
2025-05-07T20:25:47.9900990Z 
2025-05-07T20:25:47.9900995Z 
2025-05-07T20:25:47.9901000Z 
2025-05-07T20:25:47.9901005Z 
2025-05-07T20:25:47.9901011Z 
2025-05-07T20:25:47.9901016Z 
2025-05-07T20:25:47.9901021Z 
2025-05-07T20:25:47.9995351Z libnvjitlink-12.8.61 | 28.7 MB   | ######4    |  65% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:47.9995795Z 
2025-05-07T20:25:47.9995800Z 
2025-05-07T20:25:47.9995805Z 
2025-05-07T20:25:47.9995811Z 
2025-05-07T20:25:47.9995816Z 
2025-05-07T20:25:47.9995821Z 
2025-05-07T20:25:47.9995826Z 
2025-05-07T20:25:47.9995831Z 
2025-05-07T20:25:47.9995837Z 
2025-05-07T20:25:47.9995842Z 
2025-05-07T20:25:47.9995847Z 
2025-05-07T20:25:47.9995852Z 
2025-05-07T20:25:47.9995870Z 
2025-05-07T20:25:47.9995875Z 
2025-05-07T20:25:48.0016994Z python-3.9.18        | 22.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.0017653Z 
2025-05-07T20:25:48.0017659Z 
2025-05-07T20:25:48.0017673Z 
2025-05-07T20:25:48.0017678Z 
2025-05-07T20:25:48.0017683Z 
2025-05-07T20:25:48.0017688Z 
2025-05-07T20:25:48.0017693Z 
2025-05-07T20:25:48.0017698Z 
2025-05-07T20:25:48.0017703Z 
2025-05-07T20:25:48.0017709Z 
2025-05-07T20:25:48.0017713Z 
2025-05-07T20:25:48.0017719Z 
2025-05-07T20:25:48.0017723Z 
2025-05-07T20:25:48.0404328Z cuda-nvvm-tools-12.8 | 23.5 MB   | #          |  11% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.0404803Z 
2025-05-07T20:25:48.0404809Z 
2025-05-07T20:25:48.0404814Z 
2025-05-07T20:25:48.0404819Z 
2025-05-07T20:25:48.0404825Z 
2025-05-07T20:25:48.0404829Z 
2025-05-07T20:25:48.0404835Z 
2025-05-07T20:25:48.0404840Z 
2025-05-07T20:25:48.0404845Z 
2025-05-07T20:25:48.0404850Z 
2025-05-07T20:25:48.0404855Z 
2025-05-07T20:25:48.0409848Z 
2025-05-07T20:25:48.1039642Z cuda-nvcc-tools-12.8 | 24.5 MB   | #####9     |  59% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.1040163Z 
2025-05-07T20:25:48.1040180Z 
2025-05-07T20:25:48.1040185Z 
2025-05-07T20:25:48.1040190Z 
2025-05-07T20:25:48.1040196Z 
2025-05-07T20:25:48.1040201Z 
2025-05-07T20:25:48.1040206Z 
2025-05-07T20:25:48.1040211Z 
2025-05-07T20:25:48.1040216Z 
2025-05-07T20:25:48.1040222Z 
2025-05-07T20:25:48.1040995Z 
2025-05-07T20:25:48.1067174Z libnvjitlink-12.8.61 | 28.7 MB   | #######4   |  75% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.1067624Z 
2025-05-07T20:25:48.1067630Z 
2025-05-07T20:25:48.1067635Z 
2025-05-07T20:25:48.1067641Z 
2025-05-07T20:25:48.1067646Z 
2025-05-07T20:25:48.1067651Z 
2025-05-07T20:25:48.1067656Z 
2025-05-07T20:25:48.1067661Z 
2025-05-07T20:25:48.1067675Z 
2025-05-07T20:25:48.1067680Z 
2025-05-07T20:25:48.1067685Z 
2025-05-07T20:25:48.1067691Z 
2025-05-07T20:25:48.1067696Z 
2025-05-07T20:25:48.1069295Z 
2025-05-07T20:25:48.1078379Z python-3.9.18        | 22.7 MB   | 6          |   6% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.1078810Z 
2025-05-07T20:25:48.1079048Z 
2025-05-07T20:25:48.1079054Z 
2025-05-07T20:25:48.1079058Z 
2025-05-07T20:25:48.1079062Z 
2025-05-07T20:25:48.1079065Z 
2025-05-07T20:25:48.1079069Z 
2025-05-07T20:25:48.1079073Z 
2025-05-07T20:25:48.1079076Z 
2025-05-07T20:25:48.1079080Z 
2025-05-07T20:25:48.1079084Z 
2025-05-07T20:25:48.1079087Z 
2025-05-07T20:25:48.1080485Z 
2025-05-07T20:25:48.1498358Z cuda-nvvm-tools-12.8 | 23.5 MB   | ##1        |  21% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.1498827Z 
2025-05-07T20:25:48.1498833Z 
2025-05-07T20:25:48.1498838Z 
2025-05-07T20:25:48.1498843Z 
2025-05-07T20:25:48.1498849Z 
2025-05-07T20:25:48.1498854Z 
2025-05-07T20:25:48.1498859Z 
2025-05-07T20:25:48.1498864Z 
2025-05-07T20:25:48.1498869Z 
2025-05-07T20:25:48.1498874Z 
2025-05-07T20:25:48.1498880Z 
2025-05-07T20:25:48.1498896Z 
2025-05-07T20:25:48.2074656Z cuda-nvcc-tools-12.8 | 24.5 MB   | ######9    |  70% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.2075120Z 
2025-05-07T20:25:48.2075398Z 
2025-05-07T20:25:48.2075418Z 
2025-05-07T20:25:48.2075435Z 
2025-05-07T20:25:48.2075440Z 
2025-05-07T20:25:48.2075445Z 
2025-05-07T20:25:48.2075451Z 
2025-05-07T20:25:48.2075456Z 
2025-05-07T20:25:48.2075461Z 
2025-05-07T20:25:48.2075466Z 
2025-05-07T20:25:48.2075471Z 
2025-05-07T20:25:48.2075476Z 
2025-05-07T20:25:48.2075481Z 
2025-05-07T20:25:48.2076151Z 
2025-05-07T20:25:48.2166898Z python-3.9.18        | 22.7 MB   | #6         |  16% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.2167331Z 
2025-05-07T20:25:48.2167337Z 
2025-05-07T20:25:48.2167342Z 
2025-05-07T20:25:48.2167347Z 
2025-05-07T20:25:48.2167353Z 
2025-05-07T20:25:48.2167358Z 
2025-05-07T20:25:48.2167363Z 
2025-05-07T20:25:48.2167368Z 
2025-05-07T20:25:48.2167373Z 
2025-05-07T20:25:48.2167379Z 
2025-05-07T20:25:48.2170709Z 
2025-05-07T20:25:48.2171780Z libnvjitlink-12.8.61 | 28.7 MB   | ########4  |  84% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.2172216Z 
2025-05-07T20:25:48.2172222Z 
2025-05-07T20:25:48.2172243Z 
2025-05-07T20:25:48.2172254Z 
2025-05-07T20:25:48.2172270Z 
2025-05-07T20:25:48.2172276Z 
2025-05-07T20:25:48.2172281Z 
2025-05-07T20:25:48.2172287Z 
2025-05-07T20:25:48.2172292Z 
2025-05-07T20:25:48.2172297Z 
2025-05-07T20:25:48.2172303Z 
2025-05-07T20:25:48.2172308Z 
2025-05-07T20:25:48.2172566Z 
2025-05-07T20:25:48.2611729Z cuda-nvvm-tools-12.8 | 23.5 MB   | ###1       |  32% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.2612190Z 
2025-05-07T20:25:48.2612206Z 
2025-05-07T20:25:48.2612211Z 
2025-05-07T20:25:48.2612216Z 
2025-05-07T20:25:48.2612221Z 
2025-05-07T20:25:48.2612226Z 
2025-05-07T20:25:48.2612231Z 
2025-05-07T20:25:48.2612236Z 
2025-05-07T20:25:48.2612241Z 
2025-05-07T20:25:48.2612246Z 
2025-05-07T20:25:48.2612252Z 
2025-05-07T20:25:48.2612257Z 
2025-05-07T20:25:48.3077359Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########   |  80% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.3077812Z 
2025-05-07T20:25:48.3077817Z 
2025-05-07T20:25:48.3077823Z 
2025-05-07T20:25:48.3077852Z 
2025-05-07T20:25:48.3077868Z 
2025-05-07T20:25:48.3077874Z 
2025-05-07T20:25:48.3077879Z 
2025-05-07T20:25:48.3077883Z 
2025-05-07T20:25:48.3077889Z 
2025-05-07T20:25:48.3077893Z 
2025-05-07T20:25:48.3077911Z 
2025-05-07T20:25:48.3077917Z 
2025-05-07T20:25:48.3077922Z 
2025-05-07T20:25:48.3085553Z 
2025-05-07T20:25:48.3255765Z python-3.9.18        | 22.7 MB   | ##7        |  27% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.3256217Z 
2025-05-07T20:25:48.3256224Z 
2025-05-07T20:25:48.3256230Z 
2025-05-07T20:25:48.3256236Z 
2025-05-07T20:25:48.3256240Z 
2025-05-07T20:25:48.3256246Z 
2025-05-07T20:25:48.3256251Z 
2025-05-07T20:25:48.3256256Z 
2025-05-07T20:25:48.3256261Z 
2025-05-07T20:25:48.3256266Z 
2025-05-07T20:25:48.3256271Z 
2025-05-07T20:25:48.3256276Z 
2025-05-07T20:25:48.3259759Z 
2025-05-07T20:25:48.3392643Z cuda-nvvm-tools-12.8 | 23.5 MB   | ####1      |  42% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.3393121Z 
2025-05-07T20:25:48.3393127Z 
2025-05-07T20:25:48.3393148Z 
2025-05-07T20:25:48.3393396Z 
2025-05-07T20:25:48.3393403Z 
2025-05-07T20:25:48.3393409Z 
2025-05-07T20:25:48.3393414Z 
2025-05-07T20:25:48.3393419Z 
2025-05-07T20:25:48.3393424Z 
2025-05-07T20:25:48.3393429Z 
2025-05-07T20:25:48.3393435Z 
2025-05-07T20:25:48.3651312Z libnvjitlink-12.8.61 | 28.7 MB   | #########3 |  94% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.3651763Z 
2025-05-07T20:25:48.3651769Z 
2025-05-07T20:25:48.3651774Z 
2025-05-07T20:25:48.3651780Z 
2025-05-07T20:25:48.3651785Z 
2025-05-07T20:25:48.3651790Z 
2025-05-07T20:25:48.3651804Z 
2025-05-07T20:25:48.3651809Z 
2025-05-07T20:25:48.3651814Z 
2025-05-07T20:25:48.3651819Z 
2025-05-07T20:25:48.3651824Z 
2025-05-07T20:25:48.3663443Z 
2025-05-07T20:25:48.4089014Z cuda-nvcc-tools-12.8 | 24.5 MB   | #########  |  90% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.4089524Z 
2025-05-07T20:25:48.4089530Z 
2025-05-07T20:25:48.4089535Z 
2025-05-07T20:25:48.4089540Z 
2025-05-07T20:25:48.4089546Z 
2025-05-07T20:25:48.4089551Z 
2025-05-07T20:25:48.4089842Z 
2025-05-07T20:25:48.4089848Z 
2025-05-07T20:25:48.4089853Z 
2025-05-07T20:25:48.4089858Z 
2025-05-07T20:25:48.4089863Z 
2025-05-07T20:25:48.4089868Z 
2025-05-07T20:25:48.4089874Z 
2025-05-07T20:25:48.4089896Z 
2025-05-07T20:25:48.4259510Z python-3.9.18        | 22.7 MB   | ###8       |  39% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.4259945Z 
2025-05-07T20:25:48.4259950Z 
2025-05-07T20:25:48.4259955Z 
2025-05-07T20:25:48.4259960Z 
2025-05-07T20:25:48.4259965Z 
2025-05-07T20:25:48.4259970Z 
2025-05-07T20:25:48.4259975Z 
2025-05-07T20:25:48.4259980Z 
2025-05-07T20:25:48.4259985Z 
2025-05-07T20:25:48.4259990Z 
2025-05-07T20:25:48.4259995Z 
2025-05-07T20:25:48.4260000Z 
2025-05-07T20:25:48.4266486Z 
2025-05-07T20:25:48.5094378Z cuda-nvvm-tools-12.8 | 23.5 MB   | #####1     |  52% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.5094828Z 
2025-05-07T20:25:48.5094833Z 
2025-05-07T20:25:48.5094838Z 
2025-05-07T20:25:48.5094844Z 
2025-05-07T20:25:48.5094850Z 
2025-05-07T20:25:48.5094890Z 
2025-05-07T20:25:48.5094909Z 
2025-05-07T20:25:48.5094914Z 
2025-05-07T20:25:48.5094919Z 
2025-05-07T20:25:48.5094924Z 
2025-05-07T20:25:48.5094930Z 
2025-05-07T20:25:48.5094935Z 
2025-05-07T20:25:48.5094940Z 
2025-05-07T20:25:48.5096280Z 
2025-05-07T20:25:48.5264404Z python-3.9.18        | 22.7 MB   | #####      |  51% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.5264826Z 
2025-05-07T20:25:48.5264832Z 
2025-05-07T20:25:48.5264837Z 
2025-05-07T20:25:48.5264842Z 
2025-05-07T20:25:48.5264847Z 
2025-05-07T20:25:48.5264853Z 
2025-05-07T20:25:48.5264858Z 
2025-05-07T20:25:48.5264863Z 
2025-05-07T20:25:48.5264868Z 
2025-05-07T20:25:48.5264873Z 
2025-05-07T20:25:48.5264878Z 
2025-05-07T20:25:48.5264883Z 
2025-05-07T20:25:48.5266923Z 
2025-05-07T20:25:48.6095843Z cuda-nvvm-tools-12.8 | 23.5 MB   | ######3    |  63% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.6096304Z 
2025-05-07T20:25:48.6096320Z 
2025-05-07T20:25:48.6096325Z 
2025-05-07T20:25:48.6096350Z 
2025-05-07T20:25:48.6096365Z 
2025-05-07T20:25:48.6096371Z 
2025-05-07T20:25:48.6096376Z 
2025-05-07T20:25:48.6096382Z 
2025-05-07T20:25:48.6096387Z 
2025-05-07T20:25:48.6096403Z 
2025-05-07T20:25:48.6096408Z 
2025-05-07T20:25:48.6096413Z 
2025-05-07T20:25:48.6096419Z 
2025-05-07T20:25:48.6097804Z 
2025-05-07T20:25:48.6270517Z python-3.9.18        | 22.7 MB   | ######2    |  62% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.6270949Z 
2025-05-07T20:25:48.6270955Z 
2025-05-07T20:25:48.6270960Z 
2025-05-07T20:25:48.6270965Z 
2025-05-07T20:25:48.6270970Z 
2025-05-07T20:25:48.6270975Z 
2025-05-07T20:25:48.6270980Z 
2025-05-07T20:25:48.6270985Z 
2025-05-07T20:25:48.6270990Z 
2025-05-07T20:25:48.6270996Z 
2025-05-07T20:25:48.6271001Z 
2025-05-07T20:25:48.6271007Z 
2025-05-07T20:25:48.6272799Z 
2025-05-07T20:25:48.7110271Z cuda-nvvm-tools-12.8 | 23.5 MB   | #######5   |  76% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.7110721Z 
2025-05-07T20:25:48.7110726Z 
2025-05-07T20:25:48.7110748Z 
2025-05-07T20:25:48.7111018Z 
2025-05-07T20:25:48.7111025Z 
2025-05-07T20:25:48.7111030Z 
2025-05-07T20:25:48.7111036Z 
2025-05-07T20:25:48.7111041Z 
2025-05-07T20:25:48.7111046Z 
2025-05-07T20:25:48.7111051Z 
2025-05-07T20:25:48.7111056Z 
2025-05-07T20:25:48.7111061Z 
2025-05-07T20:25:48.7111078Z 
2025-05-07T20:25:48.7111089Z 
2025-05-07T20:25:48.7271025Z python-3.9.18        | 22.7 MB   | #######3   |  73% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.7271435Z 
2025-05-07T20:25:48.7271441Z 
2025-05-07T20:25:48.7271458Z 
2025-05-07T20:25:48.7271464Z 
2025-05-07T20:25:48.7271469Z 
2025-05-07T20:25:48.7271474Z 
2025-05-07T20:25:48.7271479Z 
2025-05-07T20:25:48.7271484Z 
2025-05-07T20:25:48.7271489Z 
2025-05-07T20:25:48.7271494Z 
2025-05-07T20:25:48.7271499Z 
2025-05-07T20:25:48.7271504Z 
2025-05-07T20:25:48.7271514Z 
2025-05-07T20:25:48.8116942Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########7  |  88% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.8117416Z 
2025-05-07T20:25:48.8117675Z 
2025-05-07T20:25:48.8117693Z 
2025-05-07T20:25:48.8117698Z 
2025-05-07T20:25:48.8117703Z 
2025-05-07T20:25:48.8117709Z 
2025-05-07T20:25:48.8117714Z 
2025-05-07T20:25:48.8117719Z 
2025-05-07T20:25:48.8117724Z 
2025-05-07T20:25:48.8117729Z 
2025-05-07T20:25:48.8117734Z 
2025-05-07T20:25:48.8117739Z 
2025-05-07T20:25:48.8117744Z 
2025-05-07T20:25:48.8117749Z 
2025-05-07T20:25:48.9591753Z python-3.9.18        | 22.7 MB   | ########5  |  86% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.9592171Z 
2025-05-07T20:25:48.9592177Z 
2025-05-07T20:25:48.9592182Z 
2025-05-07T20:25:48.9592197Z 
2025-05-07T20:25:48.9592213Z 
2025-05-07T20:25:48.9592219Z 
2025-05-07T20:25:48.9592224Z 
2025-05-07T20:25:48.9592229Z 
2025-05-07T20:25:48.9592235Z 
2025-05-07T20:25:48.9592240Z 
2025-05-07T20:25:48.9592245Z 
2025-05-07T20:25:48.9592250Z 
2025-05-07T20:25:48.9592256Z 
2025-05-07T20:25:48.9592261Z 
2025-05-07T20:25:48.9896351Z python-3.9.18        | 22.7 MB   | #########7 |  97% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:48.9896804Z 
2025-05-07T20:25:48.9896823Z 
2025-05-07T20:25:48.9896828Z 
2025-05-07T20:25:48.9896833Z 
2025-05-07T20:25:48.9896838Z 
2025-05-07T20:25:48.9896843Z 
2025-05-07T20:25:48.9896849Z 
2025-05-07T20:25:48.9896855Z 
2025-05-07T20:25:48.9896860Z 
2025-05-07T20:25:48.9896865Z 
2025-05-07T20:25:48.9896870Z 
2025-05-07T20:25:48.9896874Z 
2025-05-07T20:25:48.9896881Z 
2025-05-07T20:25:49.2996162Z cuda-nvvm-tools-12.8 | 23.5 MB   | #########9 |  99% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.2996646Z 
2025-05-07T20:25:49.2996652Z 
2025-05-07T20:25:49.2996658Z 
2025-05-07T20:25:49.2996663Z 
2025-05-07T20:25:49.2996681Z 
2025-05-07T20:25:49.2996685Z 
2025-05-07T20:25:49.2996689Z 
2025-05-07T20:25:49.2996693Z 
2025-05-07T20:25:49.2996697Z 
2025-05-07T20:25:49.2996700Z 
2025-05-07T20:25:49.2996704Z 
2025-05-07T20:25:49.2996708Z 
2025-05-07T20:25:49.3524110Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.3524598Z 
2025-05-07T20:25:49.3524629Z 
2025-05-07T20:25:49.3524634Z 
2025-05-07T20:25:49.3524639Z 
2025-05-07T20:25:49.3524644Z 
2025-05-07T20:25:49.3524649Z 
2025-05-07T20:25:49.3524653Z 
2025-05-07T20:25:49.3524659Z 
2025-05-07T20:25:49.3524663Z 
2025-05-07T20:25:49.3524669Z 
2025-05-07T20:25:49.3524674Z 
2025-05-07T20:25:49.3754827Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.3755277Z 
2025-05-07T20:25:49.3755283Z 
2025-05-07T20:25:49.3755289Z 
2025-05-07T20:25:49.3755294Z 
2025-05-07T20:25:49.3755299Z 
2025-05-07T20:25:49.3755304Z 
2025-05-07T20:25:49.3755310Z 
2025-05-07T20:25:49.3755315Z 
2025-05-07T20:25:49.3755320Z 
2025-05-07T20:25:49.3755325Z 
2025-05-07T20:25:49.3755330Z 
2025-05-07T20:25:49.3755335Z 
2025-05-07T20:25:49.3755341Z 
2025-05-07T20:25:49.3755346Z 
2025-05-07T20:25:49.3758363Z 
2025-05-07T20:25:49.3935541Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.3936054Z 
2025-05-07T20:25:49.3936363Z 
2025-05-07T20:25:49.3936370Z 
2025-05-07T20:25:49.3936375Z 
2025-05-07T20:25:49.3936380Z 
2025-05-07T20:25:49.3936386Z 
2025-05-07T20:25:49.3936391Z 
2025-05-07T20:25:49.3936408Z 
2025-05-07T20:25:49.3936413Z 
2025-05-07T20:25:49.3936418Z 
2025-05-07T20:25:49.3936423Z 
2025-05-07T20:25:49.3936428Z 
2025-05-07T20:25:49.3936433Z 
2025-05-07T20:25:49.3936439Z 
2025-05-07T20:25:49.3936444Z 
2025-05-07T20:25:49.3936449Z 
2025-05-07T20:25:49.4758128Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.4758607Z 
2025-05-07T20:25:49.4758623Z 
2025-05-07T20:25:49.4758628Z 
2025-05-07T20:25:49.4758633Z 
2025-05-07T20:25:49.4758638Z 
2025-05-07T20:25:49.4758644Z 
2025-05-07T20:25:49.4758649Z 
2025-05-07T20:25:49.4758654Z 
2025-05-07T20:25:49.4758660Z 
2025-05-07T20:25:49.4758665Z 
2025-05-07T20:25:49.4758670Z 
2025-05-07T20:25:49.4758675Z 
2025-05-07T20:25:49.4758682Z 
2025-05-07T20:25:49.4758687Z 
2025-05-07T20:25:49.4760725Z 
2025-05-07T20:25:49.4940877Z cuda-nvvm-impl-12.8. | 20.8 MB   | #4         |  14% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.4941354Z 
2025-05-07T20:25:49.4941360Z 
2025-05-07T20:25:49.4941365Z 
2025-05-07T20:25:49.4941370Z 
2025-05-07T20:25:49.4941375Z 
2025-05-07T20:25:49.4941380Z 
2025-05-07T20:25:49.4941385Z 
2025-05-07T20:25:49.4941390Z 
2025-05-07T20:25:49.4941396Z 
2025-05-07T20:25:49.4941560Z 
2025-05-07T20:25:49.4941567Z 
2025-05-07T20:25:49.4941572Z 
2025-05-07T20:25:49.4941586Z 
2025-05-07T20:25:49.4941592Z 
2025-05-07T20:25:49.4941597Z 
2025-05-07T20:25:49.4941903Z 
2025-05-07T20:25:49.5836722Z cuda-nvcc-dev_linux- | 12.7 MB   | ##2        |  22% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.5837214Z 
2025-05-07T20:25:49.5837220Z 
2025-05-07T20:25:49.5837225Z 
2025-05-07T20:25:49.5837230Z 
2025-05-07T20:25:49.5837235Z 
2025-05-07T20:25:49.5837240Z 
2025-05-07T20:25:49.5837245Z 
2025-05-07T20:25:49.5837250Z 
2025-05-07T20:25:49.5837268Z 
2025-05-07T20:25:49.5837281Z 
2025-05-07T20:25:49.5837287Z 
2025-05-07T20:25:49.5837292Z 
2025-05-07T20:25:49.5837297Z 
2025-05-07T20:25:49.5837303Z 
2025-05-07T20:25:49.5840736Z 
2025-05-07T20:25:49.5947886Z cuda-nvvm-impl-12.8. | 20.8 MB   | ##8        |  29% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.5948356Z 
2025-05-07T20:25:49.5948362Z 
2025-05-07T20:25:49.5948367Z 
2025-05-07T20:25:49.5948372Z 
2025-05-07T20:25:49.5948377Z 
2025-05-07T20:25:49.5948382Z 
2025-05-07T20:25:49.5948387Z 
2025-05-07T20:25:49.5948392Z 
2025-05-07T20:25:49.5948397Z 
2025-05-07T20:25:49.5948403Z 
2025-05-07T20:25:49.5948408Z 
2025-05-07T20:25:49.5948423Z 
2025-05-07T20:25:49.5948428Z 
2025-05-07T20:25:49.5948434Z 
2025-05-07T20:25:49.5948439Z 
2025-05-07T20:25:49.5949230Z 
2025-05-07T20:25:49.6842329Z cuda-nvcc-dev_linux- | 12.7 MB   | ####5      |  45% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.6842813Z 
2025-05-07T20:25:49.6842818Z 
2025-05-07T20:25:49.6842823Z 
2025-05-07T20:25:49.6842840Z 
2025-05-07T20:25:49.6842855Z 
2025-05-07T20:25:49.6842861Z 
2025-05-07T20:25:49.6842866Z 
2025-05-07T20:25:49.6842871Z 
2025-05-07T20:25:49.6842876Z 
2025-05-07T20:25:49.6842881Z 
2025-05-07T20:25:49.6842886Z 
2025-05-07T20:25:49.6842891Z 
2025-05-07T20:25:49.6842896Z 
2025-05-07T20:25:49.6842901Z 
2025-05-07T20:25:49.6842906Z 
2025-05-07T20:25:49.6951519Z cuda-nvvm-impl-12.8. | 20.8 MB   | ####3      |  44% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.6951990Z 
2025-05-07T20:25:49.6951996Z 
2025-05-07T20:25:49.6952001Z 
2025-05-07T20:25:49.6952006Z 
2025-05-07T20:25:49.6952011Z 
2025-05-07T20:25:49.6952017Z 
2025-05-07T20:25:49.6952022Z 
2025-05-07T20:25:49.6952027Z 
2025-05-07T20:25:49.6952032Z 
2025-05-07T20:25:49.6952037Z 
2025-05-07T20:25:49.6952051Z 
2025-05-07T20:25:49.6952057Z 
2025-05-07T20:25:49.6952062Z 
2025-05-07T20:25:49.6952067Z 
2025-05-07T20:25:49.6952072Z 
2025-05-07T20:25:49.6956594Z 
2025-05-07T20:25:49.7894883Z cuda-nvcc-dev_linux- | 12.7 MB   | #######    |  71% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.7895383Z 
2025-05-07T20:25:49.7895387Z 
2025-05-07T20:25:49.7895391Z 
2025-05-07T20:25:49.7895395Z 
2025-05-07T20:25:49.7895398Z 
2025-05-07T20:25:49.7895402Z 
2025-05-07T20:25:49.7895406Z 
2025-05-07T20:25:49.7895410Z 
2025-05-07T20:25:49.7895413Z 
2025-05-07T20:25:49.7895417Z 
2025-05-07T20:25:49.7895421Z 
2025-05-07T20:25:49.7895424Z 
2025-05-07T20:25:49.7895428Z 
2025-05-07T20:25:49.7895432Z 
2025-05-07T20:25:49.7895436Z 
2025-05-07T20:25:49.7931639Z cuda-nvvm-impl-12.8. | 20.8 MB   | #####8     |  58% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.7932093Z 
2025-05-07T20:25:49.7932098Z 
2025-05-07T20:25:49.7932104Z 
2025-05-07T20:25:49.7932109Z 
2025-05-07T20:25:49.7932114Z 
2025-05-07T20:25:49.7932119Z 
2025-05-07T20:25:49.7932124Z 
2025-05-07T20:25:49.7932129Z 
2025-05-07T20:25:49.7932134Z 
2025-05-07T20:25:49.7932140Z 
2025-05-07T20:25:49.7932154Z 
2025-05-07T20:25:49.7932159Z 
2025-05-07T20:25:49.7932383Z 
2025-05-07T20:25:49.7934793Z 
2025-05-07T20:25:49.8019367Z python-3.9.18        | 22.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.8019790Z 
2025-05-07T20:25:49.8019796Z 
2025-05-07T20:25:49.8019801Z 
2025-05-07T20:25:49.8019806Z 
2025-05-07T20:25:49.8019811Z 
2025-05-07T20:25:49.8019816Z 
2025-05-07T20:25:49.8019821Z 
2025-05-07T20:25:49.8019826Z 
2025-05-07T20:25:49.8019831Z 
2025-05-07T20:25:49.8019836Z 
2025-05-07T20:25:49.8019841Z 
2025-05-07T20:25:49.8019846Z 
2025-05-07T20:25:49.8019851Z 
2025-05-07T20:25:49.8019856Z 
2025-05-07T20:25:49.8019861Z 
2025-05-07T20:25:49.8021486Z 
2025-05-07T20:25:49.8054631Z cuda-nvcc-dev_linux- | 12.7 MB   | #########4 |  94% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.8055089Z 
2025-05-07T20:25:49.8055095Z 
2025-05-07T20:25:49.8055100Z 
2025-05-07T20:25:49.8055105Z 
2025-05-07T20:25:49.8055110Z 
2025-05-07T20:25:49.8055115Z 
2025-05-07T20:25:49.8055120Z 
2025-05-07T20:25:49.8055125Z 
2025-05-07T20:25:49.8055141Z 
2025-05-07T20:25:49.8055160Z 
2025-05-07T20:25:49.8055166Z 
2025-05-07T20:25:49.8055171Z 
2025-05-07T20:25:49.8056563Z 
2025-05-07T20:25:49.8480656Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.8481107Z 
2025-05-07T20:25:49.8481113Z 
2025-05-07T20:25:49.8481118Z 
2025-05-07T20:25:49.8481123Z 
2025-05-07T20:25:49.8481128Z 
2025-05-07T20:25:49.8481133Z 
2025-05-07T20:25:49.8481146Z 
2025-05-07T20:25:49.8481151Z 
2025-05-07T20:25:49.8481156Z 
2025-05-07T20:25:49.8481161Z 
2025-05-07T20:25:49.8481166Z 
2025-05-07T20:25:49.8481171Z 
2025-05-07T20:25:49.8481176Z 
2025-05-07T20:25:49.8481181Z 
2025-05-07T20:25:49.8481186Z 
2025-05-07T20:25:49.8481191Z 
2025-05-07T20:25:49.8481196Z 
2025-05-07T20:25:49.8485021Z 
2025-05-07T20:25:49.8901298Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.8901782Z 
2025-05-07T20:25:49.8901788Z 
2025-05-07T20:25:49.8901807Z 
2025-05-07T20:25:49.8901819Z 
2025-05-07T20:25:49.8901824Z 
2025-05-07T20:25:49.8901829Z 
2025-05-07T20:25:49.8901834Z 
2025-05-07T20:25:49.8901839Z 
2025-05-07T20:25:49.8901845Z 
2025-05-07T20:25:49.8901849Z 
2025-05-07T20:25:49.8901854Z 
2025-05-07T20:25:49.8901859Z 
2025-05-07T20:25:49.8901865Z 
2025-05-07T20:25:49.8901869Z 
2025-05-07T20:25:49.8901883Z 
2025-05-07T20:25:49.8999934Z cuda-nvvm-impl-12.8. | 20.8 MB   | #######4   |  75% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.9000384Z 
2025-05-07T20:25:49.9000390Z 
2025-05-07T20:25:49.9000395Z 
2025-05-07T20:25:49.9000407Z 
2025-05-07T20:25:49.9000413Z 
2025-05-07T20:25:49.9000418Z 
2025-05-07T20:25:49.9000423Z 
2025-05-07T20:25:49.9000429Z 
2025-05-07T20:25:49.9000434Z 
2025-05-07T20:25:49.9000439Z 
2025-05-07T20:25:49.9000444Z 
2025-05-07T20:25:49.9000449Z 
2025-05-07T20:25:49.9000454Z 
2025-05-07T20:25:49.9000460Z 
2025-05-07T20:25:49.9000465Z 
2025-05-07T20:25:49.9000470Z 
2025-05-07T20:25:49.9006007Z 
2025-05-07T20:25:49.9482989Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.9483493Z 
2025-05-07T20:25:49.9483497Z 
2025-05-07T20:25:49.9483501Z 
2025-05-07T20:25:49.9483505Z 
2025-05-07T20:25:49.9483508Z 
2025-05-07T20:25:49.9483512Z 
2025-05-07T20:25:49.9483515Z 
2025-05-07T20:25:49.9483519Z 
2025-05-07T20:25:49.9483523Z 
2025-05-07T20:25:49.9483526Z 
2025-05-07T20:25:49.9483530Z 
2025-05-07T20:25:49.9483534Z 
2025-05-07T20:25:49.9483537Z 
2025-05-07T20:25:49.9483541Z 
2025-05-07T20:25:49.9483545Z 
2025-05-07T20:25:49.9483549Z 
2025-05-07T20:25:49.9483552Z 
2025-05-07T20:25:49.9484316Z 
2025-05-07T20:25:49.9994540Z cuda-nvdisasm-12.8.5 | 4.9 MB    | #####9     |  59% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.9995008Z 
2025-05-07T20:25:49.9995014Z 
2025-05-07T20:25:49.9995020Z 
2025-05-07T20:25:49.9995033Z 
2025-05-07T20:25:49.9995038Z 
2025-05-07T20:25:49.9995044Z 
2025-05-07T20:25:49.9995049Z 
2025-05-07T20:25:49.9995270Z 
2025-05-07T20:25:49.9995282Z 
2025-05-07T20:25:49.9995287Z 
2025-05-07T20:25:49.9995292Z 
2025-05-07T20:25:49.9995297Z 
2025-05-07T20:25:49.9995302Z 
2025-05-07T20:25:49.9995307Z 
2025-05-07T20:25:49.9995312Z 
2025-05-07T20:25:50.0002525Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########9  |  89% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.0002981Z 
2025-05-07T20:25:50.0002987Z 
2025-05-07T20:25:50.0002992Z 
2025-05-07T20:25:50.0002997Z 
2025-05-07T20:25:50.0003002Z 
2025-05-07T20:25:50.0003007Z 
2025-05-07T20:25:50.0003012Z 
2025-05-07T20:25:50.0003017Z 
2025-05-07T20:25:50.0003022Z 
2025-05-07T20:25:50.0003027Z 
2025-05-07T20:25:50.0003032Z 
2025-05-07T20:25:50.0003038Z 
2025-05-07T20:25:50.0003043Z 
2025-05-07T20:25:50.0003048Z 
2025-05-07T20:25:50.0003053Z 
2025-05-07T20:25:50.0003066Z 
2025-05-07T20:25:50.0003072Z 
2025-05-07T20:25:50.0255516Z cuda-sanitizer-api-1 | 8.8 MB    | ###        |  31% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.0255993Z 
2025-05-07T20:25:50.0256010Z 
2025-05-07T20:25:50.0256029Z 
2025-05-07T20:25:50.0256035Z 
2025-05-07T20:25:50.0256040Z 
2025-05-07T20:25:50.0256045Z 
2025-05-07T20:25:50.0256050Z 
2025-05-07T20:25:50.0256056Z 
2025-05-07T20:25:50.0256060Z 
2025-05-07T20:25:50.0258470Z 
2025-05-07T20:25:50.1005774Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.1006198Z 
2025-05-07T20:25:50.1006204Z 
2025-05-07T20:25:50.1006210Z 
2025-05-07T20:25:50.1006216Z 
2025-05-07T20:25:50.1006222Z 
2025-05-07T20:25:50.1006229Z 
2025-05-07T20:25:50.1006235Z 
2025-05-07T20:25:50.1006241Z 
2025-05-07T20:25:50.1006248Z 
2025-05-07T20:25:50.1006254Z 
2025-05-07T20:25:50.1006260Z 
2025-05-07T20:25:50.1006267Z 
2025-05-07T20:25:50.1006273Z 
2025-05-07T20:25:50.1006280Z 
2025-05-07T20:25:50.1006286Z 
2025-05-07T20:25:50.1006292Z 
2025-05-07T20:25:50.1011795Z 
2025-05-07T20:25:50.2127405Z cuda-sanitizer-api-1 | 8.8 MB    | ######1    |  62% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.2127906Z 
2025-05-07T20:25:50.2127912Z 
2025-05-07T20:25:50.2127917Z 
2025-05-07T20:25:50.2127931Z 
2025-05-07T20:25:50.2127937Z 
2025-05-07T20:25:50.2127942Z 
2025-05-07T20:25:50.2127947Z 
2025-05-07T20:25:50.2127952Z 
2025-05-07T20:25:50.2127957Z 
2025-05-07T20:25:50.2127962Z 
2025-05-07T20:25:50.2127967Z 
2025-05-07T20:25:50.2127972Z 
2025-05-07T20:25:50.2127977Z 
2025-05-07T20:25:50.2127982Z 
2025-05-07T20:25:50.2127987Z 
2025-05-07T20:25:50.2127992Z 
2025-05-07T20:25:50.2127997Z 
2025-05-07T20:25:50.2128377Z 
2025-05-07T20:25:50.2578389Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.2578864Z 
2025-05-07T20:25:50.2578870Z 
2025-05-07T20:25:50.2578875Z 
2025-05-07T20:25:50.2578880Z 
2025-05-07T20:25:50.2578885Z 
2025-05-07T20:25:50.2578890Z 
2025-05-07T20:25:50.2578895Z 
2025-05-07T20:25:50.2578900Z 
2025-05-07T20:25:50.2578905Z 
2025-05-07T20:25:50.2578910Z 
2025-05-07T20:25:50.2578915Z 
2025-05-07T20:25:50.2578937Z 
2025-05-07T20:25:50.2579147Z 
2025-05-07T20:25:50.2579154Z 
2025-05-07T20:25:50.2579159Z 
2025-05-07T20:25:50.2579164Z 
2025-05-07T20:25:50.2579169Z 
2025-05-07T20:25:50.2579175Z 
2025-05-07T20:25:50.2587596Z 
2025-05-07T20:25:50.2689326Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.2689732Z 
2025-05-07T20:25:50.2689738Z 
2025-05-07T20:25:50.2689744Z 
2025-05-07T20:25:50.2689749Z 
2025-05-07T20:25:50.2689754Z 
2025-05-07T20:25:50.2689759Z 
2025-05-07T20:25:50.2689764Z 
2025-05-07T20:25:50.2689769Z 
2025-05-07T20:25:50.2689774Z 
2025-05-07T20:25:50.2689779Z 
2025-05-07T20:25:50.2689784Z 
2025-05-07T20:25:50.2689790Z 
2025-05-07T20:25:50.2689795Z 
2025-05-07T20:25:50.2689799Z 
2025-05-07T20:25:50.2689805Z 
2025-05-07T20:25:50.2691870Z 
2025-05-07T20:25:50.3582268Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.3582935Z 
2025-05-07T20:25:50.3582940Z 
2025-05-07T20:25:50.3583162Z 
2025-05-07T20:25:50.3583174Z 
2025-05-07T20:25:50.3583180Z 
2025-05-07T20:25:50.3583197Z 
2025-05-07T20:25:50.3583202Z 
2025-05-07T20:25:50.3583207Z 
2025-05-07T20:25:50.3583212Z 
2025-05-07T20:25:50.3583217Z 
2025-05-07T20:25:50.3583222Z 
2025-05-07T20:25:50.3583227Z 
2025-05-07T20:25:50.3583232Z 
2025-05-07T20:25:50.3583237Z 
2025-05-07T20:25:50.3583242Z 
2025-05-07T20:25:50.3583247Z 
2025-05-07T20:25:50.3583252Z 
2025-05-07T20:25:50.3583258Z 
2025-05-07T20:25:50.3583263Z 
2025-05-07T20:25:50.4942467Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.4942887Z 
2025-05-07T20:25:50.4942892Z 
2025-05-07T20:25:50.4942897Z 
2025-05-07T20:25:50.4942903Z 
2025-05-07T20:25:50.4942908Z 
2025-05-07T20:25:50.4942915Z 
2025-05-07T20:25:50.4942920Z 
2025-05-07T20:25:50.4942925Z 
2025-05-07T20:25:50.4942930Z 
2025-05-07T20:25:50.4942935Z 
2025-05-07T20:25:50.4942941Z 
2025-05-07T20:25:50.4942946Z 
2025-05-07T20:25:50.4942951Z 
2025-05-07T20:25:50.4942967Z 
2025-05-07T20:25:50.4942989Z 
2025-05-07T20:25:50.4943005Z 
2025-05-07T20:25:50.4943811Z 
2025-05-07T20:25:50.4944547Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.4945049Z 
2025-05-07T20:25:50.4945055Z 
2025-05-07T20:25:50.4945061Z 
2025-05-07T20:25:50.4945066Z 
2025-05-07T20:25:50.4945071Z 
2025-05-07T20:25:50.4945076Z 
2025-05-07T20:25:50.4945082Z 
2025-05-07T20:25:50.4945087Z 
2025-05-07T20:25:50.4945092Z 
2025-05-07T20:25:50.4945097Z 
2025-05-07T20:25:50.4945102Z 
2025-05-07T20:25:50.4945107Z 
2025-05-07T20:25:50.4945112Z 
2025-05-07T20:25:50.4945117Z 
2025-05-07T20:25:50.4945122Z 
2025-05-07T20:25:50.4945127Z 
2025-05-07T20:25:50.4945132Z 
2025-05-07T20:25:50.5150125Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.5150620Z 
2025-05-07T20:25:50.5150625Z 
2025-05-07T20:25:50.5150630Z 
2025-05-07T20:25:50.5150635Z 
2025-05-07T20:25:50.5150640Z 
2025-05-07T20:25:50.5150655Z 
2025-05-07T20:25:50.5150665Z 
2025-05-07T20:25:50.5150670Z 
2025-05-07T20:25:50.5150684Z 
2025-05-07T20:25:50.5150690Z 
2025-05-07T20:25:50.5150695Z 
2025-05-07T20:25:50.5150700Z 
2025-05-07T20:25:50.5150704Z 
2025-05-07T20:25:50.5150709Z 
2025-05-07T20:25:50.5150714Z 
2025-05-07T20:25:50.5150719Z 
2025-05-07T20:25:50.5150724Z 
2025-05-07T20:25:50.5150730Z 
2025-05-07T20:25:50.5151350Z 
2025-05-07T20:25:50.7173117Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.7173464Z 
2025-05-07T20:25:50.7173470Z 
2025-05-07T20:25:50.7173475Z 
2025-05-07T20:25:50.7173480Z 
2025-05-07T20:25:50.7173485Z 
2025-05-07T20:25:50.7173493Z 
2025-05-07T20:25:50.7173500Z 
2025-05-07T20:25:50.7173505Z 
2025-05-07T20:25:50.7173511Z 
2025-05-07T20:25:50.7173516Z 
2025-05-07T20:25:50.7173521Z 
2025-05-07T20:25:50.7173527Z 
2025-05-07T20:25:50.7173533Z 
2025-05-07T20:25:50.7173538Z 
2025-05-07T20:25:50.7175464Z 
2025-05-07T20:25:52.3996710Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.3997089Z 
2025-05-07T20:25:52.3997102Z 
2025-05-07T20:25:52.3997106Z 
2025-05-07T20:25:52.3997109Z 
2025-05-07T20:25:52.3997113Z 
2025-05-07T20:25:52.3997117Z 
2025-05-07T20:25:52.3997120Z 
2025-05-07T20:25:52.3997124Z 
2025-05-07T20:25:52.3997131Z 
2025-05-07T20:25:52.7726039Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.7726378Z 
2025-05-07T20:25:52.7726393Z 
2025-05-07T20:25:52.7726396Z 
2025-05-07T20:25:52.7726400Z 
2025-05-07T20:25:52.7726404Z 
2025-05-07T20:25:52.7726408Z 
2025-05-07T20:25:52.7726412Z 
2025-05-07T20:25:53.1775681Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:53.3812579Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:25:53.3813107Z 
2025-05-07T20:25:53.3813116Z 
2025-05-07T20:25:53.3813123Z 
2025-05-07T20:25:53.3813145Z 
2025-05-07T20:25:53.3813152Z 
2025-05-07T20:25:53.7253478Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:53.7254276Z 
2025-05-07T20:25:53.7254287Z 
2025-05-07T20:25:53.7254297Z 
2025-05-07T20:25:53.7254307Z 
2025-05-07T20:25:53.7254317Z 
2025-05-07T20:25:53.7254327Z 
2025-05-07T20:25:53.7254337Z 
2025-05-07T20:25:53.7254346Z 
2025-05-07T20:25:53.7254356Z 
2025-05-07T20:25:53.7254367Z 
2025-05-07T20:25:53.7254377Z 
2025-05-07T20:25:53.7254387Z 
2025-05-07T20:25:53.8893417Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.8893773Z 
2025-05-07T20:25:53.8893778Z 
2025-05-07T20:25:53.8893781Z 
2025-05-07T20:25:53.8893785Z 
2025-05-07T20:25:53.8893789Z 
2025-05-07T20:25:53.8893793Z 
2025-05-07T20:25:53.8893797Z 
2025-05-07T20:25:53.8893805Z 
2025-05-07T20:25:54.1938295Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:54.1938621Z 
2025-05-07T20:25:54.1938625Z 
2025-05-07T20:25:54.1938629Z 
2025-05-07T20:25:54.1938633Z 
2025-05-07T20:25:54.1938664Z 
2025-05-07T20:25:54.1938681Z 
2025-05-07T20:25:54.1938685Z 
2025-05-07T20:25:54.1938688Z 
2025-05-07T20:25:54.1938692Z 
2025-05-07T20:25:54.1938713Z 
2025-05-07T20:25:54.1939422Z 
2025-05-07T20:25:54.6808444Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.6808775Z 
2025-05-07T20:25:54.6808780Z 
2025-05-07T20:25:54.6808784Z 
2025-05-07T20:25:54.6808788Z 
2025-05-07T20:25:54.6808801Z 
2025-05-07T20:25:54.6808804Z 
2025-05-07T20:25:54.6808808Z 
2025-05-07T20:25:54.6808812Z 
2025-05-07T20:25:54.6808815Z 
2025-05-07T20:25:54.6808819Z 
2025-05-07T20:25:54.6808822Z 
2025-05-07T20:25:54.6808826Z 
2025-05-07T20:25:54.6808835Z 
2025-05-07T20:25:54.7333327Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.7333774Z 
2025-05-07T20:25:54.7333779Z 
2025-05-07T20:25:54.7333784Z 
2025-05-07T20:25:54.7333789Z 
2025-05-07T20:25:54.7333795Z 
2025-05-07T20:25:54.7333800Z 
2025-05-07T20:25:54.7333831Z 
2025-05-07T20:25:54.7333852Z 
2025-05-07T20:25:54.7333857Z 
2025-05-07T20:25:54.7333863Z 
2025-05-07T20:25:54.7333867Z 
2025-05-07T20:25:54.7333872Z 
2025-05-07T20:25:54.7333877Z 
2025-05-07T20:25:54.7333882Z 
2025-05-07T20:25:54.7333887Z 
2025-05-07T20:25:54.7333892Z 
2025-05-07T20:25:54.7333896Z 
2025-05-07T20:25:54.7333912Z 
2025-05-07T20:25:55.2028551Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.2028926Z 
2025-05-07T20:25:55.2028930Z 
2025-05-07T20:25:55.2028943Z 
2025-05-07T20:25:55.2028947Z 
2025-05-07T20:25:55.2028951Z 
2025-05-07T20:25:55.2028954Z 
2025-05-07T20:25:55.2028958Z 
2025-05-07T20:25:55.2028961Z 
2025-05-07T20:25:55.2028965Z 
2025-05-07T20:25:55.2028976Z 
2025-05-07T20:25:55.2028979Z 
2025-05-07T20:25:55.2028983Z 
2025-05-07T20:25:55.2028986Z 
2025-05-07T20:25:55.2028990Z 
2025-05-07T20:25:55.2028994Z 
2025-05-07T20:25:55.2028998Z 
2025-05-07T20:25:55.3334753Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3335131Z 
2025-05-07T20:25:55.3335136Z 
2025-05-07T20:25:55.3335139Z 
2025-05-07T20:25:55.3335143Z 
2025-05-07T20:25:55.3335146Z 
2025-05-07T20:25:55.3335150Z 
2025-05-07T20:25:55.3335153Z 
2025-05-07T20:25:55.3335157Z 
2025-05-07T20:25:55.3335160Z 
2025-05-07T20:25:55.3335164Z 
2025-05-07T20:25:55.3335173Z 
2025-05-07T20:25:55.3335176Z 
2025-05-07T20:25:55.3335188Z 
2025-05-07T20:25:55.3335192Z 
2025-05-07T20:25:55.4715103Z python-3.9.18        | 22.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.4715441Z 
2025-05-07T20:25:55.4715445Z 
2025-05-07T20:25:55.4715449Z 
2025-05-07T20:25:55.4715453Z 
2025-05-07T20:25:55.4715456Z 
2025-05-07T20:25:55.4715460Z 
2025-05-07T20:25:55.4715464Z 
2025-05-07T20:25:55.4715468Z 
2025-05-07T20:25:55.4715472Z 
2025-05-07T20:25:55.4715475Z 
2025-05-07T20:25:55.4715479Z 
2025-05-07T20:25:55.4715483Z 
2025-05-07T20:25:55.4715486Z 
2025-05-07T20:25:55.4715738Z 
2025-05-07T20:25:55.4715753Z 
2025-05-07T20:25:55.4715757Z 
2025-05-07T20:25:55.4715765Z 
2025-05-07T20:25:55.5147262Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.5147630Z 
2025-05-07T20:25:55.5147634Z 
2025-05-07T20:25:55.5147638Z 
2025-05-07T20:25:55.5147642Z 
2025-05-07T20:25:55.5147646Z 
2025-05-07T20:25:55.5147650Z 
2025-05-07T20:25:55.5147653Z 
2025-05-07T20:25:55.5147657Z 
2025-05-07T20:25:55.5147661Z 
2025-05-07T20:25:55.5147672Z 
2025-05-07T20:25:55.5147676Z 
2025-05-07T20:25:55.5147679Z 
2025-05-07T20:25:55.5147683Z 
2025-05-07T20:25:55.5147686Z 
2025-05-07T20:25:55.5147690Z 
2025-05-07T20:25:55.5147694Z 
2025-05-07T20:25:55.5147697Z 
2025-05-07T20:25:55.5147701Z 
2025-05-07T20:25:55.5147705Z 
2025-05-07T20:25:55.9475069Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.9475374Z 
2025-05-07T20:25:55.9475378Z 
2025-05-07T20:25:55.9475382Z 
2025-05-07T20:25:55.9475406Z 
2025-05-07T20:25:55.9475422Z 
2025-05-07T20:25:55.9475426Z 
2025-05-07T20:25:55.9475429Z 
2025-05-07T20:25:55.9475433Z 
2025-05-07T20:25:55.9475436Z 
2025-05-07T20:25:55.9475440Z 
2025-05-07T20:25:55.9475443Z 
2025-05-07T20:25:55.9475447Z 
2025-05-07T20:25:55.9475451Z 
2025-05-07T20:25:55.9475454Z 
2025-05-07T20:25:55.9477119Z 
2025-05-07T20:26:00.0565382Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:00.0565745Z 
2025-05-07T20:26:01.3021157Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:26:01.3029193Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:26:01.3029533Z 
2025-05-07T20:26:01.3029539Z 
2025-05-07T20:26:01.3029543Z 
2025-05-07T20:26:01.3029548Z 
2025-05-07T20:26:01.3029553Z 
2025-05-07T20:26:01.3029558Z 
2025-05-07T20:26:01.3029563Z 
2025-05-07T20:26:01.3029570Z 
2025-05-07T20:26:01.3029575Z 
2025-05-07T20:26:01.3029581Z 
2025-05-07T20:26:01.3029586Z 
2025-05-07T20:26:01.3029618Z 
2025-05-07T20:26:01.3029639Z 
2025-05-07T20:26:01.3029645Z 
2025-05-07T20:26:01.3029650Z 
2025-05-07T20:26:01.3029655Z 
2025-05-07T20:26:01.3029837Z 
2025-05-07T20:26:01.3029842Z 
2025-05-07T20:26:01.3029848Z 
2025-05-07T20:26:01.3029943Z                       
2025-05-07T20:26:01.3030298Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3030627Z                                                      
2025-05-07T20:26:01.3030848Z 
2025-05-07T20:26:01.3031012Z                                                      [A
2025-05-07T20:26:01.3031230Z 
2025-05-07T20:26:01.3031234Z 
2025-05-07T20:26:01.3031404Z                                                      [A[A
2025-05-07T20:26:01.3031618Z 
2025-05-07T20:26:01.3031622Z 
2025-05-07T20:26:01.3031626Z 
2025-05-07T20:26:01.3032108Z                                                      [A[A[A
2025-05-07T20:26:01.3032424Z 
2025-05-07T20:26:01.3032430Z 
2025-05-07T20:26:01.3032436Z 
2025-05-07T20:26:01.3032441Z 
2025-05-07T20:26:01.3032973Z                                                      [A[A[A[A
2025-05-07T20:26:01.3033285Z 
2025-05-07T20:26:01.3033292Z 
2025-05-07T20:26:01.3033297Z 
2025-05-07T20:26:01.3033303Z 
2025-05-07T20:26:01.3033308Z 
2025-05-07T20:26:01.3033568Z                                                      [A[A[A[A[A
2025-05-07T20:26:01.3033864Z 
2025-05-07T20:26:01.3033869Z 
2025-05-07T20:26:01.3033874Z 
2025-05-07T20:26:01.3033879Z 
2025-05-07T20:26:01.3033885Z 
2025-05-07T20:26:01.3033890Z 
2025-05-07T20:26:01.3034136Z                                                      [A[A[A[A[A[A
2025-05-07T20:26:01.3034434Z 
2025-05-07T20:26:01.3034439Z 
2025-05-07T20:26:01.3034444Z 
2025-05-07T20:26:01.3034449Z 
2025-05-07T20:26:01.3034454Z 
2025-05-07T20:26:01.3034460Z 
2025-05-07T20:26:01.3034465Z 
2025-05-07T20:26:01.3034720Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:26:01.3035018Z 
2025-05-07T20:26:01.3035024Z 
2025-05-07T20:26:01.3035029Z 
2025-05-07T20:26:01.3035238Z 
2025-05-07T20:26:01.3035253Z 
2025-05-07T20:26:01.3035259Z 
2025-05-07T20:26:01.3035272Z 
2025-05-07T20:26:01.3035277Z 
2025-05-07T20:26:01.3035532Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3035839Z 
2025-05-07T20:26:01.3035844Z 
2025-05-07T20:26:01.3035849Z 
2025-05-07T20:26:01.3035854Z 
2025-05-07T20:26:01.3035859Z 
2025-05-07T20:26:01.3035872Z 
2025-05-07T20:26:01.3035877Z 
2025-05-07T20:26:01.3035882Z 
2025-05-07T20:26:01.3035896Z 
2025-05-07T20:26:01.3036145Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3036455Z 
2025-05-07T20:26:01.3036461Z 
2025-05-07T20:26:01.3036466Z 
2025-05-07T20:26:01.3036471Z 
2025-05-07T20:26:01.3036476Z 
2025-05-07T20:26:01.3036482Z 
2025-05-07T20:26:01.3036487Z 
2025-05-07T20:26:01.3036492Z 
2025-05-07T20:26:01.3036497Z 
2025-05-07T20:26:01.3036502Z 
2025-05-07T20:26:01.3036760Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3037092Z 
2025-05-07T20:26:01.3037097Z 
2025-05-07T20:26:01.3037102Z 
2025-05-07T20:26:01.3037107Z 
2025-05-07T20:26:01.3037112Z 
2025-05-07T20:26:01.3037117Z 
2025-05-07T20:26:01.3037122Z 
2025-05-07T20:26:01.3037128Z 
2025-05-07T20:26:01.3037133Z 
2025-05-07T20:26:01.3037138Z 
2025-05-07T20:26:01.3037143Z 
2025-05-07T20:26:01.3037424Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3037745Z 
2025-05-07T20:26:01.3037750Z 
2025-05-07T20:26:01.3037756Z 
2025-05-07T20:26:01.3037761Z 
2025-05-07T20:26:01.3037766Z 
2025-05-07T20:26:01.3037772Z 
2025-05-07T20:26:01.3037777Z 
2025-05-07T20:26:01.3037782Z 
2025-05-07T20:26:01.3037787Z 
2025-05-07T20:26:01.3037792Z 
2025-05-07T20:26:01.3037796Z 
2025-05-07T20:26:01.3037801Z 
2025-05-07T20:26:01.3038073Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3038394Z 
2025-05-07T20:26:01.3038399Z 
2025-05-07T20:26:01.3038404Z 
2025-05-07T20:26:01.3038423Z 
2025-05-07T20:26:01.3038429Z 
2025-05-07T20:26:01.3038434Z 
2025-05-07T20:26:01.3038439Z 
2025-05-07T20:26:01.3038452Z 
2025-05-07T20:26:01.3038457Z 
2025-05-07T20:26:01.3038462Z 
2025-05-07T20:26:01.3038467Z 
2025-05-07T20:26:01.3038472Z 
2025-05-07T20:26:01.3038477Z 
2025-05-07T20:26:01.3038748Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3039078Z 
2025-05-07T20:26:01.3039083Z 
2025-05-07T20:26:01.3039088Z 
2025-05-07T20:26:01.3039093Z 
2025-05-07T20:26:01.3039098Z 
2025-05-07T20:26:01.3039103Z 
2025-05-07T20:26:01.3039108Z 
2025-05-07T20:26:01.3039113Z 
2025-05-07T20:26:01.3039133Z 
2025-05-07T20:26:01.3039138Z 
2025-05-07T20:26:01.3039143Z 
2025-05-07T20:26:01.3039148Z 
2025-05-07T20:26:01.3039153Z 
2025-05-07T20:26:01.3039158Z 
2025-05-07T20:26:01.3039458Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3039781Z 
2025-05-07T20:26:01.3039794Z 
2025-05-07T20:26:01.3039905Z 
2025-05-07T20:26:01.3039911Z 
2025-05-07T20:26:01.3039916Z 
2025-05-07T20:26:01.3039921Z 
2025-05-07T20:26:01.3039926Z 
2025-05-07T20:26:01.3039931Z 
2025-05-07T20:26:01.3039936Z 
2025-05-07T20:26:01.3039941Z 
2025-05-07T20:26:01.3039946Z 
2025-05-07T20:26:01.3039951Z 
2025-05-07T20:26:01.3039964Z 
2025-05-07T20:26:01.3039969Z 
2025-05-07T20:26:01.3039974Z 
2025-05-07T20:26:01.3040270Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3040604Z 
2025-05-07T20:26:01.3040610Z 
2025-05-07T20:26:01.3040616Z 
2025-05-07T20:26:01.3040621Z 
2025-05-07T20:26:01.3040636Z 
2025-05-07T20:26:01.3040641Z 
2025-05-07T20:26:01.3040646Z 
2025-05-07T20:26:01.3040651Z 
2025-05-07T20:26:01.3040655Z 
2025-05-07T20:26:01.3040660Z 
2025-05-07T20:26:01.3040665Z 
2025-05-07T20:26:01.3040670Z 
2025-05-07T20:26:01.3040675Z 
2025-05-07T20:26:01.3040681Z 
2025-05-07T20:26:01.3040685Z 
2025-05-07T20:26:01.3040691Z 
2025-05-07T20:26:01.3041116Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3041463Z 
2025-05-07T20:26:01.3041468Z 
2025-05-07T20:26:01.3041473Z 
2025-05-07T20:26:01.3041478Z 
2025-05-07T20:26:01.3041483Z 
2025-05-07T20:26:01.3041488Z 
2025-05-07T20:26:01.3041493Z 
2025-05-07T20:26:01.3041498Z 
2025-05-07T20:26:01.3041503Z 
2025-05-07T20:26:01.3041508Z 
2025-05-07T20:26:01.3041513Z 
2025-05-07T20:26:01.3041518Z 
2025-05-07T20:26:01.3041524Z 
2025-05-07T20:26:01.3041529Z 
2025-05-07T20:26:01.3041534Z 
2025-05-07T20:26:01.3041539Z 
2025-05-07T20:26:01.3041544Z 
2025-05-07T20:26:01.3041860Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3042201Z 
2025-05-07T20:26:01.3042206Z 
2025-05-07T20:26:01.3042211Z 
2025-05-07T20:26:01.3042216Z 
2025-05-07T20:26:01.3042222Z 
2025-05-07T20:26:01.3042234Z 
2025-05-07T20:26:01.3042240Z 
2025-05-07T20:26:01.3042244Z 
2025-05-07T20:26:01.3042249Z 
2025-05-07T20:26:01.3042268Z 
2025-05-07T20:26:01.3042273Z 
2025-05-07T20:26:01.3042278Z 
2025-05-07T20:26:01.3042283Z 
2025-05-07T20:26:01.3042288Z 
2025-05-07T20:26:01.3042293Z 
2025-05-07T20:26:01.3042299Z 
2025-05-07T20:26:01.3042304Z 
2025-05-07T20:26:01.3042308Z 
2025-05-07T20:26:01.3042612Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3042958Z 
2025-05-07T20:26:01.3042963Z 
2025-05-07T20:26:01.3043103Z [A
2025-05-07T20:26:01.3043250Z 
2025-05-07T20:26:01.3043255Z 
2025-05-07T20:26:01.3043394Z [A[A
2025-05-07T20:26:01.3043541Z 
2025-05-07T20:26:01.3043547Z 
2025-05-07T20:26:01.3043552Z 
2025-05-07T20:26:01.3043715Z [A[A[A
2025-05-07T20:26:01.3043863Z 
2025-05-07T20:26:01.3043868Z 
2025-05-07T20:26:01.3043873Z 
2025-05-07T20:26:01.3043879Z 
2025-05-07T20:26:01.3044029Z [A[A[A[A
2025-05-07T20:26:01.3044184Z 
2025-05-07T20:26:01.3044189Z 
2025-05-07T20:26:01.3044194Z 
2025-05-07T20:26:01.3044199Z 
2025-05-07T20:26:01.3044212Z 
2025-05-07T20:26:01.3044371Z [A[A[A[A[A
2025-05-07T20:26:01.3044537Z 
2025-05-07T20:26:01.3044542Z 
2025-05-07T20:26:01.3044547Z 
2025-05-07T20:26:01.3044552Z 
2025-05-07T20:26:01.3044558Z 
2025-05-07T20:26:01.3044563Z 
2025-05-07T20:26:01.3044720Z [A[A[A[A[A[A
2025-05-07T20:26:01.3044893Z 
2025-05-07T20:26:01.3044898Z 
2025-05-07T20:26:01.3044903Z 
2025-05-07T20:26:01.3044908Z 
2025-05-07T20:26:01.3044913Z 
2025-05-07T20:26:01.3044918Z 
2025-05-07T20:26:01.3044923Z 
2025-05-07T20:26:01.3045093Z [A[A[A[A[A[A[A
2025-05-07T20:26:01.3045282Z 
2025-05-07T20:26:01.3045287Z 
2025-05-07T20:26:01.3045292Z 
2025-05-07T20:26:01.3045297Z 
2025-05-07T20:26:01.3045302Z 
2025-05-07T20:26:01.3045308Z 
2025-05-07T20:26:01.3045313Z 
2025-05-07T20:26:01.3045318Z 
2025-05-07T20:26:01.3045487Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3045692Z 
2025-05-07T20:26:01.3045698Z 
2025-05-07T20:26:01.3045703Z 
2025-05-07T20:26:01.3045708Z 
2025-05-07T20:26:01.3045713Z 
2025-05-07T20:26:01.3045718Z 
2025-05-07T20:26:01.3045729Z 
2025-05-07T20:26:01.3045858Z 
2025-05-07T20:26:01.3045864Z 
2025-05-07T20:26:01.3046058Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3046278Z 
2025-05-07T20:26:01.3046284Z 
2025-05-07T20:26:01.3046289Z 
2025-05-07T20:26:01.3046294Z 
2025-05-07T20:26:01.3046299Z 
2025-05-07T20:26:01.3046304Z 
2025-05-07T20:26:01.3046309Z 
2025-05-07T20:26:01.3046314Z 
2025-05-07T20:26:01.3046319Z 
2025-05-07T20:26:01.3046332Z 
2025-05-07T20:26:01.3046515Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3046744Z 
2025-05-07T20:26:01.3046749Z 
2025-05-07T20:26:01.3046755Z 
2025-05-07T20:26:01.3046759Z 
2025-05-07T20:26:01.3046765Z 
2025-05-07T20:26:01.3046770Z 
2025-05-07T20:26:01.3046775Z 
2025-05-07T20:26:01.3046787Z 
2025-05-07T20:26:01.3046792Z 
2025-05-07T20:26:01.3046797Z 
2025-05-07T20:26:01.3046802Z 
2025-05-07T20:26:01.3046990Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3047238Z 
2025-05-07T20:26:01.3047250Z 
2025-05-07T20:26:01.3047256Z 
2025-05-07T20:26:01.3047261Z 
2025-05-07T20:26:01.3047366Z 
2025-05-07T20:26:01.3047377Z 
2025-05-07T20:26:01.3047383Z 
2025-05-07T20:26:01.3047388Z 
2025-05-07T20:26:01.3047393Z 
2025-05-07T20:26:01.3047398Z 
2025-05-07T20:26:01.3047403Z 
2025-05-07T20:26:01.3047408Z 
2025-05-07T20:26:01.3047602Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3047865Z 
2025-05-07T20:26:01.3047870Z 
2025-05-07T20:26:01.3047875Z 
2025-05-07T20:26:01.3047880Z 
2025-05-07T20:26:01.3047885Z 
2025-05-07T20:26:01.3047890Z 
2025-05-07T20:26:01.3047895Z 
2025-05-07T20:26:01.3047900Z 
2025-05-07T20:26:01.3047905Z 
2025-05-07T20:26:01.3047910Z 
2025-05-07T20:26:01.3047915Z 
2025-05-07T20:26:01.3047920Z 
2025-05-07T20:26:01.3047925Z 
2025-05-07T20:26:01.3048132Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3048393Z 
2025-05-07T20:26:01.3048399Z 
2025-05-07T20:26:01.3048404Z 
2025-05-07T20:26:01.3048409Z 
2025-05-07T20:26:01.3048414Z 
2025-05-07T20:26:01.3048419Z 
2025-05-07T20:26:01.3048424Z 
2025-05-07T20:26:01.3048429Z 
2025-05-07T20:26:01.3048435Z 
2025-05-07T20:26:01.3048452Z 
2025-05-07T20:26:01.3048458Z 
2025-05-07T20:26:01.3048463Z 
2025-05-07T20:26:01.3048478Z 
2025-05-07T20:26:01.3048484Z 
2025-05-07T20:26:01.3048679Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3048951Z 
2025-05-07T20:26:01.3048956Z 
2025-05-07T20:26:01.3048961Z 
2025-05-07T20:26:01.3048966Z 
2025-05-07T20:26:01.3048972Z 
2025-05-07T20:26:01.3048986Z 
2025-05-07T20:26:01.3048993Z 
2025-05-07T20:26:01.3048999Z 
2025-05-07T20:26:01.3049006Z 
2025-05-07T20:26:01.3049012Z 
2025-05-07T20:26:01.3049018Z 
2025-05-07T20:26:01.3049025Z 
2025-05-07T20:26:01.3049032Z 
2025-05-07T20:26:01.3049038Z 
2025-05-07T20:26:01.3049044Z 
2025-05-07T20:26:01.3049294Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3049579Z 
2025-05-07T20:26:01.3049584Z 
2025-05-07T20:26:01.3049589Z 
2025-05-07T20:26:01.3049595Z 
2025-05-07T20:26:01.3049599Z 
2025-05-07T20:26:01.3049605Z 
2025-05-07T20:26:01.3049610Z 
2025-05-07T20:26:01.3049615Z 
2025-05-07T20:26:01.3049620Z 
2025-05-07T20:26:01.3049631Z 
2025-05-07T20:26:01.3049641Z 
2025-05-07T20:26:01.3049646Z 
2025-05-07T20:26:01.3049652Z 
2025-05-07T20:26:01.3049657Z 
2025-05-07T20:26:01.3049662Z 
2025-05-07T20:26:01.3049667Z 
2025-05-07T20:26:01.3049882Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3050179Z 
2025-05-07T20:26:01.3050185Z 
2025-05-07T20:26:01.3050191Z 
2025-05-07T20:26:01.3050196Z 
2025-05-07T20:26:01.3050201Z 
2025-05-07T20:26:01.3050206Z 
2025-05-07T20:26:01.3050211Z 
2025-05-07T20:26:01.3050216Z 
2025-05-07T20:26:01.3050221Z 
2025-05-07T20:26:01.3050237Z 
2025-05-07T20:26:01.3050242Z 
2025-05-07T20:26:01.3050247Z 
2025-05-07T20:26:01.3050252Z 
2025-05-07T20:26:01.3050257Z 
2025-05-07T20:26:01.3050262Z 
2025-05-07T20:26:01.3050267Z 
2025-05-07T20:26:01.3050272Z 
2025-05-07T20:26:01.3050512Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3050814Z 
2025-05-07T20:26:01.3050819Z 
2025-05-07T20:26:01.3050824Z 
2025-05-07T20:26:01.3050830Z 
2025-05-07T20:26:01.3050835Z 
2025-05-07T20:26:01.3050961Z 
2025-05-07T20:26:01.3050968Z 
2025-05-07T20:26:01.3050973Z 
2025-05-07T20:26:01.3050978Z 
2025-05-07T20:26:01.3050983Z 
2025-05-07T20:26:01.3050988Z 
2025-05-07T20:26:01.3050993Z 
2025-05-07T20:26:01.3050998Z 
2025-05-07T20:26:01.3051003Z 
2025-05-07T20:26:01.3051008Z 
2025-05-07T20:26:01.3051013Z 
2025-05-07T20:26:01.3051018Z 
2025-05-07T20:26:01.3051023Z 
2025-05-07T20:26:01.3051274Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3051580Z 
2025-05-07T20:26:01.3051586Z 
2025-05-07T20:26:01.3051727Z [A
2025-05-07T20:26:01.3051872Z 
2025-05-07T20:26:01.3051877Z 
2025-05-07T20:26:01.3052011Z [A[A
2025-05-07T20:26:01.3052158Z 
2025-05-07T20:26:01.3052163Z 
2025-05-07T20:26:01.3052168Z 
2025-05-07T20:26:01.3052302Z [A[A[A
2025-05-07T20:26:01.3052451Z 
2025-05-07T20:26:01.3052457Z 
2025-05-07T20:26:01.3052469Z 
2025-05-07T20:26:01.3052475Z 
2025-05-07T20:26:01.3052621Z [A[A[A[A
2025-05-07T20:26:01.3052780Z 
2025-05-07T20:26:01.3052785Z 
2025-05-07T20:26:01.3052922Z 
2025-05-07T20:26:01.3052934Z 
2025-05-07T20:26:01.3052939Z 
2025-05-07T20:26:01.3053106Z [A[A[A[A[A
2025-05-07T20:26:01.3053272Z 
2025-05-07T20:26:01.3053277Z 
2025-05-07T20:26:01.3053282Z 
2025-05-07T20:26:01.3053287Z 
2025-05-07T20:26:01.3053292Z 
2025-05-07T20:26:01.3053297Z 
2025-05-07T20:26:01.3053456Z [A[A[A[A[A[A
2025-05-07T20:26:01.3053629Z 
2025-05-07T20:26:01.3053634Z 
2025-05-07T20:26:01.3053639Z 
2025-05-07T20:26:01.3053644Z 
2025-05-07T20:26:01.3053649Z 
2025-05-07T20:26:01.3053654Z 
2025-05-07T20:26:01.3053659Z 
2025-05-07T20:26:01.3053831Z [A[A[A[A[A[A[A
2025-05-07T20:26:01.3054017Z 
2025-05-07T20:26:01.3054022Z 
2025-05-07T20:26:01.3054027Z 
2025-05-07T20:26:01.3054032Z 
2025-05-07T20:26:01.3054037Z 
2025-05-07T20:26:01.3054042Z 
2025-05-07T20:26:01.3054047Z 
2025-05-07T20:26:01.3054053Z 
2025-05-07T20:26:01.3054226Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3054433Z 
2025-05-07T20:26:01.3054439Z 
2025-05-07T20:26:01.3054444Z 
2025-05-07T20:26:01.3054459Z 
2025-05-07T20:26:01.3054471Z 
2025-05-07T20:26:01.3054476Z 
2025-05-07T20:26:01.3054490Z 
2025-05-07T20:26:01.3054495Z 
2025-05-07T20:26:01.3054500Z 
2025-05-07T20:26:01.3054679Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3054896Z 
2025-05-07T20:26:01.3054901Z 
2025-05-07T20:26:01.3054906Z 
2025-05-07T20:26:01.3054918Z 
2025-05-07T20:26:01.3054923Z 
2025-05-07T20:26:01.3054928Z 
2025-05-07T20:26:01.3054933Z 
2025-05-07T20:26:01.3054938Z 
2025-05-07T20:26:01.3054943Z 
2025-05-07T20:26:01.3054948Z 
2025-05-07T20:26:01.3055128Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3055357Z 
2025-05-07T20:26:01.3055362Z 
2025-05-07T20:26:01.3055367Z 
2025-05-07T20:26:01.3055372Z 
2025-05-07T20:26:01.3055377Z 
2025-05-07T20:26:01.3055387Z 
2025-05-07T20:26:01.3055392Z 
2025-05-07T20:26:01.3055396Z 
2025-05-07T20:26:01.3055402Z 
2025-05-07T20:26:01.3055407Z 
2025-05-07T20:26:01.3055412Z 
2025-05-07T20:26:01.3055795Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3056057Z 
2025-05-07T20:26:01.3056063Z 
2025-05-07T20:26:01.3056087Z 
2025-05-07T20:26:01.3056092Z 
2025-05-07T20:26:01.3056098Z 
2025-05-07T20:26:01.3056103Z 
2025-05-07T20:26:01.3056118Z 
2025-05-07T20:26:01.3056124Z 
2025-05-07T20:26:01.3056129Z 
2025-05-07T20:26:01.3056143Z 
2025-05-07T20:26:01.3056148Z 
2025-05-07T20:26:01.3056153Z 
2025-05-07T20:26:01.3056340Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3056612Z 
2025-05-07T20:26:01.3056618Z 
2025-05-07T20:26:01.3056623Z 
2025-05-07T20:26:01.3056628Z 
2025-05-07T20:26:01.3056633Z 
2025-05-07T20:26:01.3056639Z 
2025-05-07T20:26:01.3056644Z 
2025-05-07T20:26:01.3056649Z 
2025-05-07T20:26:01.3056654Z 
2025-05-07T20:26:01.3056659Z 
2025-05-07T20:26:01.3056664Z 
2025-05-07T20:26:01.3056669Z 
2025-05-07T20:26:01.3056674Z 
2025-05-07T20:26:01.3056873Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3057146Z 
2025-05-07T20:26:01.3057151Z 
2025-05-07T20:26:01.3057156Z 
2025-05-07T20:26:01.3057161Z 
2025-05-07T20:26:01.3057167Z 
2025-05-07T20:26:01.3057172Z 
2025-05-07T20:26:01.3057184Z 
2025-05-07T20:26:01.3057321Z 
2025-05-07T20:26:01.3057329Z 
2025-05-07T20:26:01.3057334Z 
2025-05-07T20:26:01.3057339Z 
2025-05-07T20:26:01.3057344Z 
2025-05-07T20:26:01.3057349Z 
2025-05-07T20:26:01.3057354Z 
2025-05-07T20:26:01.3057578Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3057852Z 
2025-05-07T20:26:01.3057858Z 
2025-05-07T20:26:01.3057863Z 
2025-05-07T20:26:01.3057868Z 
2025-05-07T20:26:01.3057873Z 
2025-05-07T20:26:01.3057878Z 
2025-05-07T20:26:01.3057883Z 
2025-05-07T20:26:01.3057897Z 
2025-05-07T20:26:01.3057902Z 
2025-05-07T20:26:01.3057907Z 
2025-05-07T20:26:01.3057912Z 
2025-05-07T20:26:01.3057917Z 
2025-05-07T20:26:01.3057922Z 
2025-05-07T20:26:01.3057926Z 
2025-05-07T20:26:01.3057931Z 
2025-05-07T20:26:01.3058137Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3058419Z 
2025-05-07T20:26:01.3058425Z 
2025-05-07T20:26:01.3058430Z 
2025-05-07T20:26:01.3058435Z 
2025-05-07T20:26:01.3058440Z 
2025-05-07T20:26:01.3058458Z 
2025-05-07T20:26:01.3058559Z 
2025-05-07T20:26:01.3058571Z 
2025-05-07T20:26:01.3058577Z 
2025-05-07T20:26:01.3058582Z 
2025-05-07T20:26:01.3058587Z 
2025-05-07T20:26:01.3058592Z 
2025-05-07T20:26:01.3058597Z 
2025-05-07T20:26:01.3058602Z 
2025-05-07T20:26:01.3058607Z 
2025-05-07T20:26:01.3058612Z 
2025-05-07T20:26:01.3058836Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3059125Z 
2025-05-07T20:26:01.3059130Z 
2025-05-07T20:26:01.3059135Z 
2025-05-07T20:26:01.3059141Z 
2025-05-07T20:26:01.3059146Z 
2025-05-07T20:26:01.3059151Z 
2025-05-07T20:26:01.3059157Z 
2025-05-07T20:26:01.3059162Z 
2025-05-07T20:26:01.3059167Z 
2025-05-07T20:26:01.3059173Z 
2025-05-07T20:26:01.3059186Z 
2025-05-07T20:26:01.3059191Z 
2025-05-07T20:26:01.3059197Z 
2025-05-07T20:26:01.3059202Z 
2025-05-07T20:26:01.3059207Z 
2025-05-07T20:26:01.3059212Z 
2025-05-07T20:26:01.3059217Z 
2025-05-07T20:26:01.3059442Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3059747Z 
2025-05-07T20:26:01.3059752Z 
2025-05-07T20:26:01.3059764Z 
2025-05-07T20:26:01.3059776Z 
2025-05-07T20:26:01.3059781Z 
2025-05-07T20:26:01.3059786Z 
2025-05-07T20:26:01.3059791Z 
2025-05-07T20:26:01.3059796Z 
2025-05-07T20:26:01.3059801Z 
2025-05-07T20:26:01.3059806Z 
2025-05-07T20:26:01.3059811Z 
2025-05-07T20:26:01.3059816Z 
2025-05-07T20:26:01.3059821Z 
2025-05-07T20:26:01.3059826Z 
2025-05-07T20:26:01.3059831Z 
2025-05-07T20:26:01.3059836Z 
2025-05-07T20:26:01.3059841Z 
2025-05-07T20:26:01.3059846Z 
2025-05-07T20:26:01.3060087Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3060384Z 
2025-05-07T20:26:01.3060389Z 
2025-05-07T20:26:01.3060522Z [A
2025-05-07T20:26:01.3060666Z 
2025-05-07T20:26:01.3060672Z 
2025-05-07T20:26:01.3060812Z [A[A
2025-05-07T20:26:01.3060962Z 
2025-05-07T20:26:01.3060968Z 
2025-05-07T20:26:01.3060973Z 
2025-05-07T20:26:01.3061114Z [A[A[A
2025-05-07T20:26:01.3061254Z 
2025-05-07T20:26:01.3061260Z 
2025-05-07T20:26:01.3061265Z 
2025-05-07T20:26:01.3061277Z 
2025-05-07T20:26:01.3061424Z [A[A[A[A
2025-05-07T20:26:01.3061590Z 
2025-05-07T20:26:01.3061595Z 
2025-05-07T20:26:01.3061601Z 
2025-05-07T20:26:01.3061606Z 
2025-05-07T20:26:01.3061611Z 
2025-05-07T20:26:01.3061764Z [A[A[A[A[A
2025-05-07T20:26:01.3061927Z 
2025-05-07T20:26:01.3061932Z 
2025-05-07T20:26:01.3061938Z 
2025-05-07T20:26:01.3061943Z 
2025-05-07T20:26:01.3061948Z 
2025-05-07T20:26:01.3061959Z 
2025-05-07T20:26:01.3062117Z [A[A[A[A[A[A
2025-05-07T20:26:01.3062287Z 
2025-05-07T20:26:01.3062292Z 
2025-05-07T20:26:01.3062297Z 
2025-05-07T20:26:01.3062302Z 
2025-05-07T20:26:01.3062307Z 
2025-05-07T20:26:01.3062312Z 
2025-05-07T20:26:01.3062317Z 
2025-05-07T20:26:01.3062481Z [A[A[A[A[A[A[A
2025-05-07T20:26:01.3062665Z 
2025-05-07T20:26:01.3062670Z 
2025-05-07T20:26:01.3062675Z 
2025-05-07T20:26:01.3062680Z 
2025-05-07T20:26:01.3062685Z 
2025-05-07T20:26:01.3062690Z 
2025-05-07T20:26:01.3062695Z 
2025-05-07T20:26:01.3062700Z 
2025-05-07T20:26:01.3062871Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3063092Z 
2025-05-07T20:26:01.3063103Z 
2025-05-07T20:26:01.3063213Z 
2025-05-07T20:26:01.3063219Z 
2025-05-07T20:26:01.3063224Z 
2025-05-07T20:26:01.3063229Z 
2025-05-07T20:26:01.3063242Z 
2025-05-07T20:26:01.3063247Z 
2025-05-07T20:26:01.3063252Z 
2025-05-07T20:26:01.3063438Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3063652Z 
2025-05-07T20:26:01.3063657Z 
2025-05-07T20:26:01.3063662Z 
2025-05-07T20:26:01.3063675Z 
2025-05-07T20:26:01.3063680Z 
2025-05-07T20:26:01.3063686Z 
2025-05-07T20:26:01.3063691Z 
2025-05-07T20:26:01.3063696Z 
2025-05-07T20:26:01.3063700Z 
2025-05-07T20:26:01.3063705Z 
2025-05-07T20:26:01.3063890Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3064125Z 
2025-05-07T20:26:01.3064130Z 
2025-05-07T20:26:01.3064135Z 
2025-05-07T20:26:01.3064140Z 
2025-05-07T20:26:01.3064145Z 
2025-05-07T20:26:01.3064150Z 
2025-05-07T20:26:01.3064155Z 
2025-05-07T20:26:01.3064160Z 
2025-05-07T20:26:01.3064165Z 
2025-05-07T20:26:01.3064170Z 
2025-05-07T20:26:01.3064175Z 
2025-05-07T20:26:01.3064354Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3064709Z 
2025-05-07T20:26:01.3064714Z 
2025-05-07T20:26:01.3064719Z 
2025-05-07T20:26:01.3064724Z 
2025-05-07T20:26:01.3064730Z 
2025-05-07T20:26:01.3064735Z 
2025-05-07T20:26:01.3064740Z 
2025-05-07T20:26:01.3064745Z 
2025-05-07T20:26:01.3064750Z 
2025-05-07T20:26:01.3064755Z 
2025-05-07T20:26:01.3064760Z 
2025-05-07T20:26:01.3064765Z 
2025-05-07T20:26:01.3064966Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3065222Z 
2025-05-07T20:26:01.3065228Z 
2025-05-07T20:26:01.3065233Z 
2025-05-07T20:26:01.3065238Z 
2025-05-07T20:26:01.3065243Z 
2025-05-07T20:26:01.3065248Z 
2025-05-07T20:26:01.3065254Z 
2025-05-07T20:26:01.3065259Z 
2025-05-07T20:26:01.3065264Z 
2025-05-07T20:26:01.3065270Z 
2025-05-07T20:26:01.3065275Z 
2025-05-07T20:26:01.3065287Z 
2025-05-07T20:26:01.3065292Z 
2025-05-07T20:26:01.3065487Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3065753Z 
2025-05-07T20:26:01.3065758Z 
2025-05-07T20:26:01.3065763Z 
2025-05-07T20:26:01.3065768Z 
2025-05-07T20:26:01.3065780Z 
2025-05-07T20:26:01.3065791Z 
2025-05-07T20:26:01.3065804Z 
2025-05-07T20:26:01.3065810Z 
2025-05-07T20:26:01.3065814Z 
2025-05-07T20:26:01.3065820Z 
2025-05-07T20:26:01.3065825Z 
2025-05-07T20:26:01.3065830Z 
2025-05-07T20:26:01.3065835Z 
2025-05-07T20:26:01.3065840Z 
2025-05-07T20:26:01.3066043Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3066319Z 
2025-05-07T20:26:01.3066325Z 
2025-05-07T20:26:01.3066330Z 
2025-05-07T20:26:01.3066335Z 
2025-05-07T20:26:01.3066340Z 
2025-05-07T20:26:01.3066345Z 
2025-05-07T20:26:01.3066350Z 
2025-05-07T20:26:01.3066355Z 
2025-05-07T20:26:01.3066360Z 
2025-05-07T20:26:01.3066365Z 
2025-05-07T20:26:01.3066370Z 
2025-05-07T20:26:01.3066375Z 
2025-05-07T20:26:01.3066380Z 
2025-05-07T20:26:01.3066385Z 
2025-05-07T20:26:01.3066390Z 
2025-05-07T20:26:01.3066599Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3066876Z 
2025-05-07T20:26:01.3066881Z 
2025-05-07T20:26:01.3066886Z 
2025-05-07T20:26:01.3066891Z 
2025-05-07T20:26:01.3066902Z 
2025-05-07T20:26:01.3066913Z 
2025-05-07T20:26:01.3066918Z 
2025-05-07T20:26:01.3066923Z 
2025-05-07T20:26:01.3066928Z 
2025-05-07T20:26:01.3066933Z 
2025-05-07T20:26:01.3066938Z 
2025-05-07T20:26:01.3066950Z 
2025-05-07T20:26:01.3066955Z 
2025-05-07T20:26:01.3066960Z 
2025-05-07T20:26:01.3066965Z 
2025-05-07T20:26:01.3066970Z 
2025-05-07T20:26:01.3067185Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3067488Z 
2025-05-07T20:26:01.3067494Z 
2025-05-07T20:26:01.3067499Z 
2025-05-07T20:26:01.3067505Z 
2025-05-07T20:26:01.3067510Z 
2025-05-07T20:26:01.3067515Z 
2025-05-07T20:26:01.3067520Z 
2025-05-07T20:26:01.3067525Z 
2025-05-07T20:26:01.3067530Z 
2025-05-07T20:26:01.3067535Z 
2025-05-07T20:26:01.3067540Z 
2025-05-07T20:26:01.3067545Z 
2025-05-07T20:26:01.3067550Z 
2025-05-07T20:26:01.3067555Z 
2025-05-07T20:26:01.3067560Z 
2025-05-07T20:26:01.3067565Z 
2025-05-07T20:26:01.3067571Z 
2025-05-07T20:26:01.3067819Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3068124Z 
2025-05-07T20:26:01.3068238Z 
2025-05-07T20:26:01.3068244Z 
2025-05-07T20:26:01.3068249Z 
2025-05-07T20:26:01.3068254Z 
2025-05-07T20:26:01.3068259Z 
2025-05-07T20:26:01.3068265Z 
2025-05-07T20:26:01.3068270Z 
2025-05-07T20:26:01.3068275Z 
2025-05-07T20:26:01.3068280Z 
2025-05-07T20:26:01.3068285Z 
2025-05-07T20:26:01.3068290Z 
2025-05-07T20:26:01.3068304Z 
2025-05-07T20:26:01.3068309Z 
2025-05-07T20:26:01.3068314Z 
2025-05-07T20:26:01.3068319Z 
2025-05-07T20:26:01.3068324Z 
2025-05-07T20:26:01.3068329Z 
2025-05-07T20:26:01.3068583Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3068889Z 
2025-05-07T20:26:01.3068895Z 
2025-05-07T20:26:01.3069030Z [A
2025-05-07T20:26:01.3069167Z 
2025-05-07T20:26:01.3069172Z 
2025-05-07T20:26:01.3069317Z [A[A
2025-05-07T20:26:01.3069465Z 
2025-05-07T20:26:01.3069470Z 
2025-05-07T20:26:01.3069475Z 
2025-05-07T20:26:01.3069635Z [A[A[A
2025-05-07T20:26:01.3069994Z 
2025-05-07T20:26:01.3069999Z 
2025-05-07T20:26:01.3070004Z 
2025-05-07T20:26:01.3070116Z 
2025-05-07T20:26:01.3070276Z [A[A[A[A
2025-05-07T20:26:01.3070439Z 
2025-05-07T20:26:01.3070444Z 
2025-05-07T20:26:01.3070449Z 
2025-05-07T20:26:01.3070455Z 
2025-05-07T20:26:01.3070460Z 
2025-05-07T20:26:01.3070614Z [A[A[A[A[A
2025-05-07T20:26:01.3070788Z 
2025-05-07T20:26:01.3070793Z 
2025-05-07T20:26:01.3070799Z 
2025-05-07T20:26:01.3070804Z 
2025-05-07T20:26:01.3070809Z 
2025-05-07T20:26:01.3070814Z 
2025-05-07T20:26:01.3070974Z [A[A[A[A[A[A
2025-05-07T20:26:01.3071151Z 
2025-05-07T20:26:01.3071156Z 
2025-05-07T20:26:01.3071161Z 
2025-05-07T20:26:01.3071167Z 
2025-05-07T20:26:01.3071171Z 
2025-05-07T20:26:01.3071177Z 
2025-05-07T20:26:01.3071182Z 
2025-05-07T20:26:01.3071334Z [A[A[A[A[A[A[A
2025-05-07T20:26:01.3071535Z 
2025-05-07T20:26:01.3071540Z 
2025-05-07T20:26:01.3071545Z 
2025-05-07T20:26:01.3071551Z 
2025-05-07T20:26:01.3071556Z 
2025-05-07T20:26:01.3071561Z 
2025-05-07T20:26:01.3071566Z 
2025-05-07T20:26:01.3071571Z 
2025-05-07T20:26:01.3071742Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3071966Z 
2025-05-07T20:26:01.3071971Z 
2025-05-07T20:26:01.3071976Z 
2025-05-07T20:26:01.3071981Z 
2025-05-07T20:26:01.3071986Z 
2025-05-07T20:26:01.3071991Z 
2025-05-07T20:26:01.3071995Z 
2025-05-07T20:26:01.3072001Z 
2025-05-07T20:26:01.3072005Z 
2025-05-07T20:26:01.3072171Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3072389Z 
2025-05-07T20:26:01.3072394Z 
2025-05-07T20:26:01.3072399Z 
2025-05-07T20:26:01.3072404Z 
2025-05-07T20:26:01.3072409Z 
2025-05-07T20:26:01.3072414Z 
2025-05-07T20:26:01.3072420Z 
2025-05-07T20:26:01.3072425Z 
2025-05-07T20:26:01.3072430Z 
2025-05-07T20:26:01.3072435Z 
2025-05-07T20:26:01.3072633Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3072860Z 
2025-05-07T20:26:01.3072865Z 
2025-05-07T20:26:01.3072870Z 
2025-05-07T20:26:01.3072875Z 
2025-05-07T20:26:01.3072879Z 
2025-05-07T20:26:01.3072885Z 
2025-05-07T20:26:01.3072890Z 
2025-05-07T20:26:01.3072895Z 
2025-05-07T20:26:01.3072900Z 
2025-05-07T20:26:01.3072905Z 
2025-05-07T20:26:01.3072916Z 
2025-05-07T20:26:01.3073108Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3073354Z 
2025-05-07T20:26:01.3073360Z 
2025-05-07T20:26:01.3073365Z 
2025-05-07T20:26:01.3073370Z 
2025-05-07T20:26:01.3073376Z 
2025-05-07T20:26:01.3073381Z 
2025-05-07T20:26:01.3073386Z 
2025-05-07T20:26:01.3073391Z 
2025-05-07T20:26:01.3073404Z 
2025-05-07T20:26:01.3073409Z 
2025-05-07T20:26:01.3073414Z 
2025-05-07T20:26:01.3073428Z 
2025-05-07T20:26:01.3073612Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3073865Z 
2025-05-07T20:26:01.3073878Z 
2025-05-07T20:26:01.3073884Z 
2025-05-07T20:26:01.3073889Z 
2025-05-07T20:26:01.3073894Z 
2025-05-07T20:26:01.3073899Z 
2025-05-07T20:26:01.3073904Z 
2025-05-07T20:26:01.3073909Z 
2025-05-07T20:26:01.3073914Z 
2025-05-07T20:26:01.3073918Z 
2025-05-07T20:26:01.3073923Z 
2025-05-07T20:26:01.3073928Z 
2025-05-07T20:26:01.3073933Z 
2025-05-07T20:26:01.3074116Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3074389Z 
2025-05-07T20:26:01.3074394Z 
2025-05-07T20:26:01.3074526Z 
2025-05-07T20:26:01.3074532Z 
2025-05-07T20:26:01.3074538Z 
2025-05-07T20:26:01.3074543Z 
2025-05-07T20:26:01.3074548Z 
2025-05-07T20:26:01.3074553Z 
2025-05-07T20:26:01.3074558Z 
2025-05-07T20:26:01.3074563Z 
2025-05-07T20:26:01.3074568Z 
2025-05-07T20:26:01.3074573Z 
2025-05-07T20:26:01.3074578Z 
2025-05-07T20:26:01.3074583Z 
2025-05-07T20:26:01.3074812Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3075084Z 
2025-05-07T20:26:01.3075090Z 
2025-05-07T20:26:01.3075095Z 
2025-05-07T20:26:01.3075100Z 
2025-05-07T20:26:01.3075105Z 
2025-05-07T20:26:01.3075110Z 
2025-05-07T20:26:01.3075115Z 
2025-05-07T20:26:01.3075120Z 
2025-05-07T20:26:01.3075126Z 
2025-05-07T20:26:01.3075139Z 
2025-05-07T20:26:01.3075145Z 
2025-05-07T20:26:01.3075150Z 
2025-05-07T20:26:01.3075155Z 
2025-05-07T20:26:01.3075160Z 
2025-05-07T20:26:01.3075165Z 
2025-05-07T20:26:01.3075364Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3075641Z 
2025-05-07T20:26:01.3075654Z 
2025-05-07T20:26:01.3075783Z 
2025-05-07T20:26:01.3075794Z 
2025-05-07T20:26:01.3075799Z 
2025-05-07T20:26:01.3075804Z 
2025-05-07T20:26:01.3075809Z 
2025-05-07T20:26:01.3075814Z 
2025-05-07T20:26:01.3075819Z 
2025-05-07T20:26:01.3075824Z 
2025-05-07T20:26:01.3075829Z 
2025-05-07T20:26:01.3075834Z 
2025-05-07T20:26:01.3075839Z 
2025-05-07T20:26:01.3075844Z 
2025-05-07T20:26:01.3075849Z 
2025-05-07T20:26:01.3075854Z 
2025-05-07T20:26:01.3076086Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3076377Z 
2025-05-07T20:26:01.3076383Z 
2025-05-07T20:26:01.3076388Z 
2025-05-07T20:26:01.3076393Z 
2025-05-07T20:26:01.3076398Z 
2025-05-07T20:26:01.3076404Z 
2025-05-07T20:26:01.3076409Z 
2025-05-07T20:26:01.3076415Z 
2025-05-07T20:26:01.3076420Z 
2025-05-07T20:26:01.3076425Z 
2025-05-07T20:26:01.3076430Z 
2025-05-07T20:26:01.3076435Z 
2025-05-07T20:26:01.3076447Z 
2025-05-07T20:26:01.3076452Z 
2025-05-07T20:26:01.3076457Z 
2025-05-07T20:26:01.3076462Z 
2025-05-07T20:26:01.3076467Z 
2025-05-07T20:26:01.3076703Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3076991Z 
2025-05-07T20:26:01.3076996Z 
2025-05-07T20:26:01.3077009Z 
2025-05-07T20:26:01.3077015Z 
2025-05-07T20:26:01.3077020Z 
2025-05-07T20:26:01.3077025Z 
2025-05-07T20:26:01.3077030Z 
2025-05-07T20:26:01.3077035Z 
2025-05-07T20:26:01.3077040Z 
2025-05-07T20:26:01.3077045Z 
2025-05-07T20:26:01.3077050Z 
2025-05-07T20:26:01.3077055Z 
2025-05-07T20:26:01.3077060Z 
2025-05-07T20:26:01.3077065Z 
2025-05-07T20:26:01.3077070Z 
2025-05-07T20:26:01.3077075Z 
2025-05-07T20:26:01.3077080Z 
2025-05-07T20:26:01.3077085Z 
2025-05-07T20:26:01.3077325Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3077620Z 
2025-05-07T20:26:01.3077625Z 
2025-05-07T20:26:01.3077755Z [A
2025-05-07T20:26:01.3077897Z 
2025-05-07T20:26:01.3077903Z 
2025-05-07T20:26:01.3078034Z [A[A
2025-05-07T20:26:01.3078173Z 
2025-05-07T20:26:01.3078178Z 
2025-05-07T20:26:01.3078194Z 
2025-05-07T20:26:01.3078339Z [A[A[A
2025-05-07T20:26:01.3078488Z 
2025-05-07T20:26:01.3078500Z 
2025-05-07T20:26:01.3078505Z 
2025-05-07T20:26:01.3078509Z 
2025-05-07T20:26:01.3078665Z [A[A[A[A
2025-05-07T20:26:01.3078824Z 
2025-05-07T20:26:01.3078830Z 
2025-05-07T20:26:01.3078835Z 
2025-05-07T20:26:01.3078840Z 
2025-05-07T20:26:01.3078845Z 
2025-05-07T20:26:01.3079002Z [A[A[A[A[A
2025-05-07T20:26:01.3079172Z 
2025-05-07T20:26:01.3079178Z 
2025-05-07T20:26:01.3079183Z 
2025-05-07T20:26:01.3079188Z 
2025-05-07T20:26:01.3079193Z 
2025-05-07T20:26:01.3079198Z 
2025-05-07T20:26:01.3079369Z [A[A[A[A[A[A
2025-05-07T20:26:01.3079540Z 
2025-05-07T20:26:01.3079546Z 
2025-05-07T20:26:01.3079551Z 
2025-05-07T20:26:01.3079556Z 
2025-05-07T20:26:01.3079561Z 
2025-05-07T20:26:01.3079566Z 
2025-05-07T20:26:01.3079571Z 
2025-05-07T20:26:01.3079732Z [A[A[A[A[A[A[A
2025-05-07T20:26:01.3079924Z 
2025-05-07T20:26:01.3079929Z 
2025-05-07T20:26:01.3079934Z 
2025-05-07T20:26:01.3079939Z 
2025-05-07T20:26:01.3079944Z 
2025-05-07T20:26:01.3079949Z 
2025-05-07T20:26:01.3079959Z 
2025-05-07T20:26:01.3080075Z 
2025-05-07T20:26:01.3080263Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3080464Z 
2025-05-07T20:26:01.3080469Z 
2025-05-07T20:26:01.3080474Z 
2025-05-07T20:26:01.3080479Z 
2025-05-07T20:26:01.3080484Z 
2025-05-07T20:26:01.3080489Z 
2025-05-07T20:26:01.3080494Z 
2025-05-07T20:26:01.3080499Z 
2025-05-07T20:26:01.3080513Z 
2025-05-07T20:26:01.3080676Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3080886Z 
2025-05-07T20:26:01.3080892Z 
2025-05-07T20:26:01.3080897Z 
2025-05-07T20:26:01.3080902Z 
2025-05-07T20:26:01.3080907Z 
2025-05-07T20:26:01.3080912Z 
2025-05-07T20:26:01.3080924Z 
2025-05-07T20:26:01.3080929Z 
2025-05-07T20:26:01.3080934Z 
2025-05-07T20:26:01.3080939Z 
2025-05-07T20:26:01.3081123Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3081344Z 
2025-05-07T20:26:01.3081349Z 
2025-05-07T20:26:01.3081364Z 
2025-05-07T20:26:01.3081369Z 
2025-05-07T20:26:01.3081374Z 
2025-05-07T20:26:01.3081379Z 
2025-05-07T20:26:01.3081384Z 
2025-05-07T20:26:01.3081389Z 
2025-05-07T20:26:01.3081489Z 
2025-05-07T20:26:01.3081504Z 
2025-05-07T20:26:01.3081509Z 
2025-05-07T20:26:01.3081689Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3081936Z 
2025-05-07T20:26:01.3081942Z 
2025-05-07T20:26:01.3081947Z 
2025-05-07T20:26:01.3081952Z 
2025-05-07T20:26:01.3081957Z 
2025-05-07T20:26:01.3081963Z 
2025-05-07T20:26:01.3081968Z 
2025-05-07T20:26:01.3081973Z 
2025-05-07T20:26:01.3081978Z 
2025-05-07T20:26:01.3081984Z 
2025-05-07T20:26:01.3081989Z 
2025-05-07T20:26:01.3081994Z 
2025-05-07T20:26:01.3082197Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3082619Z 
2025-05-07T20:26:01.3082625Z 
2025-05-07T20:26:01.3082630Z 
2025-05-07T20:26:01.3082635Z 
2025-05-07T20:26:01.3082640Z 
2025-05-07T20:26:01.3082646Z 
2025-05-07T20:26:01.3082650Z 
2025-05-07T20:26:01.3082655Z 
2025-05-07T20:26:01.3082661Z 
2025-05-07T20:26:01.3082666Z 
2025-05-07T20:26:01.3082671Z 
2025-05-07T20:26:01.3082676Z 
2025-05-07T20:26:01.3082681Z 
2025-05-07T20:26:01.3083103Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3083394Z 
2025-05-07T20:26:01.3083399Z 
2025-05-07T20:26:01.3083404Z 
2025-05-07T20:26:01.3083409Z 
2025-05-07T20:26:01.3083414Z 
2025-05-07T20:26:01.3083420Z 
2025-05-07T20:26:01.3083425Z 
2025-05-07T20:26:01.3083430Z 
2025-05-07T20:26:01.3083444Z 
2025-05-07T20:26:01.3083449Z 
2025-05-07T20:26:01.3083454Z 
2025-05-07T20:26:01.3083459Z 
2025-05-07T20:26:01.3083464Z 
2025-05-07T20:26:01.3083469Z 
2025-05-07T20:26:01.3083675Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3083945Z 
2025-05-07T20:26:01.3083962Z 
2025-05-07T20:26:01.3083968Z 
2025-05-07T20:26:01.3083973Z 
2025-05-07T20:26:01.3083978Z 
2025-05-07T20:26:01.3083983Z 
2025-05-07T20:26:01.3083988Z 
2025-05-07T20:26:01.3083993Z 
2025-05-07T20:26:01.3083998Z 
2025-05-07T20:26:01.3084003Z 
2025-05-07T20:26:01.3084008Z 
2025-05-07T20:26:01.3084013Z 
2025-05-07T20:26:01.3084018Z 
2025-05-07T20:26:01.3084023Z 
2025-05-07T20:26:01.3084028Z 
2025-05-07T20:26:01.3084248Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3084537Z 
2025-05-07T20:26:01.3084543Z 
2025-05-07T20:26:01.3084548Z 
2025-05-07T20:26:01.3084553Z 
2025-05-07T20:26:01.3084558Z 
2025-05-07T20:26:01.3084563Z 
2025-05-07T20:26:01.3084568Z 
2025-05-07T20:26:01.3084573Z 
2025-05-07T20:26:01.3084578Z 
2025-05-07T20:26:01.3084583Z 
2025-05-07T20:26:01.3084588Z 
2025-05-07T20:26:01.3084593Z 
2025-05-07T20:26:01.3084723Z 
2025-05-07T20:26:01.3084736Z 
2025-05-07T20:26:01.3084741Z 
2025-05-07T20:26:01.3084747Z 
2025-05-07T20:26:01.3084954Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3085239Z 
2025-05-07T20:26:01.3085244Z 
2025-05-07T20:26:01.3085249Z 
2025-05-07T20:26:01.3085255Z 
2025-05-07T20:26:01.3085266Z 
2025-05-07T20:26:01.3085272Z 
2025-05-07T20:26:01.3085277Z 
2025-05-07T20:26:01.3085282Z 
2025-05-07T20:26:01.3085287Z 
2025-05-07T20:26:01.3085292Z 
2025-05-07T20:26:01.3085297Z 
2025-05-07T20:26:01.3085302Z 
2025-05-07T20:26:01.3085307Z 
2025-05-07T20:26:01.3085312Z 
2025-05-07T20:26:01.3085317Z 
2025-05-07T20:26:01.3085328Z 
2025-05-07T20:26:01.3085501Z 
2025-05-07T20:26:01.3085723Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3086022Z 
2025-05-07T20:26:01.3086027Z 
2025-05-07T20:26:01.3086032Z 
2025-05-07T20:26:01.3086037Z 
2025-05-07T20:26:01.3086042Z 
2025-05-07T20:26:01.3086047Z 
2025-05-07T20:26:01.3086052Z 
2025-05-07T20:26:01.3086057Z 
2025-05-07T20:26:01.3086062Z 
2025-05-07T20:26:01.3086067Z 
2025-05-07T20:26:01.3086072Z 
2025-05-07T20:26:01.3086077Z 
2025-05-07T20:26:01.3086082Z 
2025-05-07T20:26:01.3086087Z 
2025-05-07T20:26:01.3086092Z 
2025-05-07T20:26:01.3086097Z 
2025-05-07T20:26:01.3086102Z 
2025-05-07T20:26:01.3086107Z 
2025-05-07T20:26:01.3086344Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3086643Z 
2025-05-07T20:26:01.3086649Z 
2025-05-07T20:26:01.3086792Z [A
2025-05-07T20:26:01.3086929Z 
2025-05-07T20:26:01.3086935Z 
2025-05-07T20:26:01.3087069Z [A[A
2025-05-07T20:26:01.3087217Z 
2025-05-07T20:26:01.3087222Z 
2025-05-07T20:26:01.3087363Z 
2025-05-07T20:26:01.3087523Z [A[A[A
2025-05-07T20:26:01.3087679Z 
2025-05-07T20:26:01.3087684Z 
2025-05-07T20:26:01.3087689Z 
2025-05-07T20:26:01.3087693Z 
2025-05-07T20:26:01.3087835Z [A[A[A[A
2025-05-07T20:26:01.3087989Z 
2025-05-07T20:26:01.3087994Z 
2025-05-07T20:26:01.3088007Z 
2025-05-07T20:26:01.3088013Z 
2025-05-07T20:26:01.3088018Z 
2025-05-07T20:26:01.3088176Z [A[A[A[A[A
2025-05-07T20:26:01.3088343Z 
2025-05-07T20:26:01.3088355Z 
2025-05-07T20:26:01.3088361Z 
2025-05-07T20:26:01.3088366Z 
2025-05-07T20:26:01.3088371Z 
2025-05-07T20:26:01.3088376Z 
2025-05-07T20:26:01.3088532Z [A[A[A[A[A[A
2025-05-07T20:26:01.3088707Z 
2025-05-07T20:26:01.3088712Z 
2025-05-07T20:26:01.3088726Z 
2025-05-07T20:26:01.3088731Z 
2025-05-07T20:26:01.3088736Z 
2025-05-07T20:26:01.3088742Z 
2025-05-07T20:26:01.3088747Z 
2025-05-07T20:26:01.3088904Z [A[A[A[A[A[A[A
2025-05-07T20:26:01.3089090Z 
2025-05-07T20:26:01.3089095Z 
2025-05-07T20:26:01.3089101Z 
2025-05-07T20:26:01.3089113Z 
2025-05-07T20:26:01.3089127Z 
2025-05-07T20:26:01.3089136Z 
2025-05-07T20:26:01.3089141Z 
2025-05-07T20:26:01.3089146Z 
2025-05-07T20:26:01.3089305Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3089507Z 
2025-05-07T20:26:01.3089513Z 
2025-05-07T20:26:01.3089518Z 
2025-05-07T20:26:01.3089530Z 
2025-05-07T20:26:01.3089535Z 
2025-05-07T20:26:01.3089540Z 
2025-05-07T20:26:01.3089546Z 
2025-05-07T20:26:01.3089551Z 
2025-05-07T20:26:01.3089556Z 
2025-05-07T20:26:01.3089720Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3089938Z 
2025-05-07T20:26:01.3089950Z 
2025-05-07T20:26:01.3089955Z 
2025-05-07T20:26:01.3089960Z 
2025-05-07T20:26:01.3089965Z 
2025-05-07T20:26:01.3089970Z 
2025-05-07T20:26:01.3089976Z 
2025-05-07T20:26:01.3089981Z 
2025-05-07T20:26:01.3089986Z 
2025-05-07T20:26:01.3089991Z 
2025-05-07T20:26:01.3090164Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3090397Z 
2025-05-07T20:26:01.3090402Z 
2025-05-07T20:26:01.3090408Z 
2025-05-07T20:26:01.3090413Z 
2025-05-07T20:26:01.3090418Z 
2025-05-07T20:26:01.3090455Z 
2025-05-07T20:26:01.3090465Z 
2025-05-07T20:26:01.3090470Z 
2025-05-07T20:26:01.3090475Z 
2025-05-07T20:26:01.3090479Z 
2025-05-07T20:26:01.3090484Z 
2025-05-07T20:26:01.3090667Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3090906Z 
2025-05-07T20:26:01.3090911Z 
2025-05-07T20:26:01.3090916Z 
2025-05-07T20:26:01.3090921Z 
2025-05-07T20:26:01.3090926Z 
2025-05-07T20:26:01.3090931Z 
2025-05-07T20:26:01.3090936Z 
2025-05-07T20:26:01.3090941Z 
2025-05-07T20:26:01.3090946Z 
2025-05-07T20:26:01.3090952Z 
2025-05-07T20:26:01.3090957Z 
2025-05-07T20:26:01.3090962Z 
2025-05-07T20:26:01.3091144Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3091394Z 
2025-05-07T20:26:01.3091399Z 
2025-05-07T20:26:01.3091404Z 
2025-05-07T20:26:01.3091410Z 
2025-05-07T20:26:01.3091415Z 
2025-05-07T20:26:01.3091420Z 
2025-05-07T20:26:01.3091425Z 
2025-05-07T20:26:01.3091437Z 
2025-05-07T20:26:01.3091443Z 
2025-05-07T20:26:01.3091448Z 
2025-05-07T20:26:01.3091453Z 
2025-05-07T20:26:01.3091458Z 
2025-05-07T20:26:01.3091469Z 
2025-05-07T20:26:01.3091751Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3092018Z 
2025-05-07T20:26:01.3092023Z 
2025-05-07T20:26:01.3092036Z 
2025-05-07T20:26:01.3092041Z 
2025-05-07T20:26:01.3092046Z 
2025-05-07T20:26:01.3092051Z 
2025-05-07T20:26:01.3092056Z 
2025-05-07T20:26:01.3092061Z 
2025-05-07T20:26:01.3092066Z 
2025-05-07T20:26:01.3092071Z 
2025-05-07T20:26:01.3092076Z 
2025-05-07T20:26:01.3092081Z 
2025-05-07T20:26:01.3092086Z 
2025-05-07T20:26:01.3092091Z 
2025-05-07T20:26:01.3092283Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3092559Z 
2025-05-07T20:26:01.3092564Z 
2025-05-07T20:26:01.3092570Z 
2025-05-07T20:26:01.3092575Z 
2025-05-07T20:26:01.3092580Z 
2025-05-07T20:26:01.3092585Z 
2025-05-07T20:26:01.3092590Z 
2025-05-07T20:26:01.3092595Z 
2025-05-07T20:26:01.3092600Z 
2025-05-07T20:26:01.3092605Z 
2025-05-07T20:26:01.3092610Z 
2025-05-07T20:26:01.3092615Z 
2025-05-07T20:26:01.3092621Z 
2025-05-07T20:26:01.3092626Z 
2025-05-07T20:26:01.3092631Z 
2025-05-07T20:26:01.3092927Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3093203Z 
2025-05-07T20:26:01.3093208Z 
2025-05-07T20:26:01.3093213Z 
2025-05-07T20:26:01.3093218Z 
2025-05-07T20:26:01.3093223Z 
2025-05-07T20:26:01.3093228Z 
2025-05-07T20:26:01.3093233Z 
2025-05-07T20:26:01.3093238Z 
2025-05-07T20:26:01.3093243Z 
2025-05-07T20:26:01.3093247Z 
2025-05-07T20:26:01.3093252Z 
2025-05-07T20:26:01.3093264Z 
2025-05-07T20:26:01.3093269Z 
2025-05-07T20:26:01.3093274Z 
2025-05-07T20:26:01.3093279Z 
2025-05-07T20:26:01.3093284Z 
2025-05-07T20:26:01.3093490Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3093781Z 
2025-05-07T20:26:01.3093786Z 
2025-05-07T20:26:01.3093800Z 
2025-05-07T20:26:01.3093805Z 
2025-05-07T20:26:01.3093810Z 
2025-05-07T20:26:01.3093815Z 
2025-05-07T20:26:01.3093820Z 
2025-05-07T20:26:01.3093824Z 
2025-05-07T20:26:01.3093829Z 
2025-05-07T20:26:01.3093834Z 
2025-05-07T20:26:01.3093839Z 
2025-05-07T20:26:01.3093844Z 
2025-05-07T20:26:01.3093849Z 
2025-05-07T20:26:01.3093862Z 
2025-05-07T20:26:01.3093873Z 
2025-05-07T20:26:01.3093878Z 
2025-05-07T20:26:01.3093883Z 
2025-05-07T20:26:01.3094099Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3094404Z 
2025-05-07T20:26:01.3094410Z 
2025-05-07T20:26:01.3094415Z 
2025-05-07T20:26:01.3094421Z 
2025-05-07T20:26:01.3094426Z 
2025-05-07T20:26:01.3094456Z 
2025-05-07T20:26:01.3094460Z 
2025-05-07T20:26:01.3094465Z 
2025-05-07T20:26:01.3094470Z 
2025-05-07T20:26:01.3094486Z 
2025-05-07T20:26:01.3094492Z 
2025-05-07T20:26:01.3094497Z 
2025-05-07T20:26:01.3094502Z 
2025-05-07T20:26:01.3094507Z 
2025-05-07T20:26:01.3094512Z 
2025-05-07T20:26:01.3094517Z 
2025-05-07T20:26:01.3094522Z 
2025-05-07T20:26:01.3094527Z 
2025-05-07T20:26:01.3094750Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3095053Z 
2025-05-07T20:26:01.3095058Z 
2025-05-07T20:26:01.3095191Z [A
2025-05-07T20:26:01.3095327Z 
2025-05-07T20:26:01.3095332Z 
2025-05-07T20:26:01.3095481Z [A[A
2025-05-07T20:26:01.3095630Z 
2025-05-07T20:26:01.3095638Z 
2025-05-07T20:26:01.3095642Z 
2025-05-07T20:26:01.3095752Z [A[A[A
2025-05-07T20:26:01.3095986Z 
2025-05-07T20:26:01.3095990Z 
2025-05-07T20:26:01.3095994Z 
2025-05-07T20:26:01.3095997Z 
2025-05-07T20:26:01.3096108Z [A[A[A[A
2025-05-07T20:26:01.3096241Z 
2025-05-07T20:26:01.3096245Z 
2025-05-07T20:26:01.3096248Z 
2025-05-07T20:26:01.3096252Z 
2025-05-07T20:26:01.3096256Z 
2025-05-07T20:26:01.3096365Z [A[A[A[A[A
2025-05-07T20:26:01.3096493Z 
2025-05-07T20:26:01.3096496Z 
2025-05-07T20:26:01.3096500Z 
2025-05-07T20:26:01.3096503Z 
2025-05-07T20:26:01.3096507Z 
2025-05-07T20:26:01.3096510Z 
2025-05-07T20:26:01.3096625Z [A[A[A[A[A[A
2025-05-07T20:26:01.3096781Z 
2025-05-07T20:26:01.3096787Z 
2025-05-07T20:26:01.3096791Z 
2025-05-07T20:26:01.3096796Z 
2025-05-07T20:26:01.3096801Z 
2025-05-07T20:26:01.3096806Z 
2025-05-07T20:26:01.3096811Z 
2025-05-07T20:26:01.3096977Z [A[A[A[A[A[A[A
2025-05-07T20:26:01.3097175Z 
2025-05-07T20:26:01.3097181Z 
2025-05-07T20:26:01.3097194Z 
2025-05-07T20:26:01.3097325Z 
2025-05-07T20:26:01.3097329Z 
2025-05-07T20:26:01.3097333Z 
2025-05-07T20:26:01.3097336Z 
2025-05-07T20:26:01.3097340Z 
2025-05-07T20:26:01.3097471Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3097629Z 
2025-05-07T20:26:01.3097633Z 
2025-05-07T20:26:01.3097636Z 
2025-05-07T20:26:01.3097640Z 
2025-05-07T20:26:01.3097643Z 
2025-05-07T20:26:01.3097647Z 
2025-05-07T20:26:01.3097651Z 
2025-05-07T20:26:01.3097654Z 
2025-05-07T20:26:01.3097658Z 
2025-05-07T20:26:01.3097780Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3097942Z 
2025-05-07T20:26:01.3097946Z 
2025-05-07T20:26:01.3097949Z 
2025-05-07T20:26:01.3097953Z 
2025-05-07T20:26:01.3097957Z 
2025-05-07T20:26:01.3097960Z 
2025-05-07T20:26:01.3097964Z 
2025-05-07T20:26:01.3097967Z 
2025-05-07T20:26:01.3097971Z 
2025-05-07T20:26:01.3097975Z 
2025-05-07T20:26:01.3098100Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3098271Z 
2025-05-07T20:26:01.3098275Z 
2025-05-07T20:26:01.3098278Z 
2025-05-07T20:26:01.3098360Z 
2025-05-07T20:26:01.3098370Z 
2025-05-07T20:26:01.3098374Z 
2025-05-07T20:26:01.3098377Z 
2025-05-07T20:26:01.3098381Z 
2025-05-07T20:26:01.3098384Z 
2025-05-07T20:26:01.3098388Z 
2025-05-07T20:26:01.3098391Z 
2025-05-07T20:26:01.3098523Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3098709Z 
2025-05-07T20:26:01.3098712Z 
2025-05-07T20:26:01.3098716Z 
2025-05-07T20:26:01.3098719Z 
2025-05-07T20:26:01.3098723Z 
2025-05-07T20:26:01.3098726Z 
2025-05-07T20:26:01.3098730Z 
2025-05-07T20:26:01.3098734Z 
2025-05-07T20:26:01.3098737Z 
2025-05-07T20:26:01.3098741Z 
2025-05-07T20:26:01.3098744Z 
2025-05-07T20:26:01.3098748Z 
2025-05-07T20:26:01.3098936Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3099150Z 
2025-05-07T20:26:01.3099153Z 
2025-05-07T20:26:01.3099157Z 
2025-05-07T20:26:01.3099160Z 
2025-05-07T20:26:01.3099164Z 
2025-05-07T20:26:01.3099168Z 
2025-05-07T20:26:01.3099171Z 
2025-05-07T20:26:01.3099175Z 
2025-05-07T20:26:01.3099178Z 
2025-05-07T20:26:01.3099182Z 
2025-05-07T20:26:01.3099192Z 
2025-05-07T20:26:01.3099217Z 
2025-05-07T20:26:01.3099220Z 
2025-05-07T20:26:01.3099362Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.3099558Z 
2025-05-07T20:26:01.3099562Z 
2025-05-07T20:26:01.3099565Z 
2025-05-07T20:26:01.3099575Z 
2025-05-07T20:26:01.3099579Z 
2025-05-07T20:26:01.3099582Z 
2025-05-07T20:26:01.3099586Z 
2025-05-07T20:26:01.3099589Z 
2025-05-07T20:26:01.3099593Z 
2025-05-07T20:26:01.3099597Z 
2025-05-07T20:26:01.3099600Z 
2025-05-07T20:26:01.3099604Z 
2025-05-07T20:26:01.3099607Z 
2025-05-07T20:26:01.3099611Z 
2025-05-07T20:26:01.3099765Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:26:01.6308361Z Preparing transaction: - \ | done
2025-05-07T20:26:05.4833164Z Verifying transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:26:06.5016311Z Executing transaction: / - \ | / - \ | / - done
2025-05-07T20:26:08.8586362Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ...
2025-05-07T20:26:08.8586847Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:08.8587617Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:08.8588253Z 
2025-05-07T20:26:08.8601583Z 
2025-05-07T20:26:08.8602593Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:08.8603391Z 
2025-05-07T20:26:08.8614137Z 
2025-05-07T20:26:08.8614475Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:08.8619948Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:08.8624356Z 
2025-05-07T20:26:09.0319274Z 
2025-05-07T20:26:09.0325304Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:09.0329833Z 
2025-05-07T20:26:09.0346632Z 
2025-05-07T20:26:09.0347224Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:09.0721538Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:10.9472576Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:11.0095787Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:11.0096358Z 
2025-05-07T20:26:11.4301678Z 
2025-05-07T20:26:11.4309783Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:11.4658620Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:11.4659162Z 
2025-05-07T20:26:11.8984901Z 
2025-05-07T20:26:11.8985513Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:11.8986596Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:11.8987407Z 
2025-05-07T20:26:12.3215475Z 
2025-05-07T20:26:14.3451645Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:16.3591631Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:18.4000862Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:18.4001907Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:20.4408119Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:22.3364026Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:22.3364424Z 
2025-05-07T20:26:22.3994133Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:26:26.2509540Z /tmp/tmp56xggd1p: line 3: clang: command not found
2025-05-07T20:26:26.2509969Z 
2025-05-07T20:26:26.2512617Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:26:26.3144393Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:26:26.3144732Z 
2025-05-07T20:26:26.3166004Z total 36
2025-05-07T20:26:26.3166296Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:26 .
2025-05-07T20:26:26.3166678Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:24 ..
2025-05-07T20:26:26.3167143Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:26:26.3167675Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:26:26.3168181Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:26:26.3168657Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:26:26.3169115Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:26:26.3169596Z -rw-r--r--. 2 ec2-user ec2-user  2932 Jan 24 22:22 ~cuda-nvcc_activate.sh
2025-05-07T20:26:26.3170115Z 
2025-05-07T20:26:26.3170343Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:26:26.3171019Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:26:26.3171475Z 
2025-05-07T20:26:26.3191935Z 
2025-05-07T20:26:26.3192356Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:26:26.3192636Z 
2025-05-07T20:26:28.2791825Z 
2025-05-07T20:26:28.2792207Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:26:28.2792789Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:26:28.2793195Z 
2025-05-07T20:26:28.7042784Z 
2025-05-07T20:26:28.7043126Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:26:28.7043405Z 
2025-05-07T20:26:30.5910112Z -allow-unsupported-compiler
2025-05-07T20:26:30.5910413Z 
2025-05-07T20:26:30.6535031Z 
2025-05-07T20:26:30.6535342Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:26:30.6535883Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:26:30.6536276Z 
2025-05-07T20:26:32.6017424Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:26:32.6018044Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:26:32.6018383Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:26:32.6018707Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:26:32.6019031Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:26:32.6019466Z #define _STL_PAIR_H 1
2025-05-07T20:26:32.6019722Z #define __cpp_attributes 200809L
2025-05-07T20:26:32.6020115Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:26:32.6020597Z #define __DELETE_THROW throw()
2025-05-07T20:26:32.6020957Z #define _PTRDIFF_T_ 
2025-05-07T20:26:32.6021283Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:26:32.6021674Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:26:32.6021962Z #define _IO_LEFT 02
2025-05-07T20:26:32.6022182Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:26:32.6022456Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:26:32.6022729Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:26:32.6023166Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:26:32.6023695Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:26:32.6024038Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:26:32.6024297Z #define _IOS_OUTPUT 2
2025-05-07T20:26:32.6024533Z #define __SM_100_RT_HPP__ 
2025-05-07T20:26:32.6024855Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:26:32.6025330Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:26:32.6025753Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:26:32.6026121Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:26:32.6026471Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:26:32.6037973Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:26:32.6039211Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:26:32.6039618Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:26:32.6039929Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:26:32.6040247Z #define _T_WCHAR_ 
2025-05-07T20:26:32.6040466Z #define stdout stdout
2025-05-07T20:26:32.6040804Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:26:32.6041201Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:26:32.6041460Z #define __flexarr []
2025-05-07T20:26:32.6041697Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:26:32.6042064Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:26:32.6042553Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:26:32.6042898Z #define _MATH_H 1
2025-05-07T20:26:32.6043271Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:26:32.6043980Z #define __S64_TYPE long int
2025-05-07T20:26:32.6044316Z #define __stub_fchflags 
2025-05-07T20:26:32.6044676Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:26:32.6045080Z #define __SQUAD_TYPE long int
2025-05-07T20:26:32.6045740Z #define __INTMAX_C(c) c ## L
2025-05-07T20:26:32.6046158Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:26:32.6046625Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:26:32.6046974Z #define NL_NMAX INT_MAX
2025-05-07T20:26:32.6047282Z #define _BITS_TIME_H 1
2025-05-07T20:26:32.6047662Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:26:32.6048111Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:26:32.6048516Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:26:32.6048975Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:26:32.6049387Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:26:32.6049931Z #define __CHAR_BIT__ 8
2025-05-07T20:26:32.6050197Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:32.6050530Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:26:32.6050820Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:26:32.6051088Z #define FP_NAN 0
2025-05-07T20:26:32.6051347Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:26:32.6051767Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:26:32.6052155Z #define __cudaCDP2GetErrorString 
2025-05-07T20:26:32.6052442Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:26:32.6052704Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:26:32.6052950Z #define __SM_80_RT_H__ 
2025-05-07T20:26:32.6053172Z #define _NEW 
2025-05-07T20:26:32.6053393Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:26:32.6053667Z #define __UINT8_MAX__ 0xff
2025-05-07T20:26:32.6054049Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:26:32.6054476Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:26:32.6054724Z #define __USE_ANSI 1
2025-05-07T20:26:32.6055020Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:26:32.6055434Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:26:32.6055806Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:26:32.6056106Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:26:32.6056389Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:26:32.6056717Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:26:32.6057008Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:26:32.6057301Z #define PIPE_BUF 4096
2025-05-07T20:26:32.6057627Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:26:32.6058099Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:26:32.6058492Z #define ADJ_TICK 0x4000
2025-05-07T20:26:32.6058779Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:26:32.6059104Z #define MQ_PRIO_MAX 32768
2025-05-07T20:26:32.6059373Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:26:32.6059700Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:26:32.6060298Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:32.6060836Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:26:32.6061207Z #define _XOPEN_SOURCE 700
2025-05-07T20:26:32.6061461Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:26:32.6061728Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:26:32.6062010Z #define __cpp_static_assert 201411L
2025-05-07T20:26:32.6062293Z #define __GLIBCXX__ 20230528
2025-05-07T20:26:32.6062549Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:26:32.6062824Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:26:32.6063106Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:26:32.6063402Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:26:32.6063691Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:26:32.6063997Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:32.6064358Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:26:32.6064720Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:26:32.6065095Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:26:32.6065411Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:32.6065772Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:26:32.6066132Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:26:32.6066429Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:26:32.6066716Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:26:32.6067045Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:26:32.6067374Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:26:32.6067785Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:26:32.6068210Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:26:32.6068518Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:26:32.6068789Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:26:32.6069066Z #define __GCC_IEC_559 2
2025-05-07T20:26:32.6069358Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:26:32.6069878Z #define _IO_flockfile(_fp) 
2025-05-07T20:26:32.6070147Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:26:32.6070415Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:26:32.6070683Z #define _IOFBF 0
2025-05-07T20:26:32.6070891Z #define __USE_BSD 1
2025-05-07T20:26:32.6071114Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:26:32.6071394Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:26:32.6071665Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:26:32.6071921Z #define _IO_NO_WRITES 8
2025-05-07T20:26:32.6072183Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:26:32.6072544Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:26:32.6072909Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:26:32.6073225Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:26:32.6073558Z #define __cpp_binary_literals 201304L
2025-05-07T20:26:32.6073854Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:26:32.6074128Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:26:32.6074425Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:26:32.6074734Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:26:32.6075130Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:26:32.6075501Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:26:32.6075803Z #define M_PI 3.14159265358979323846
2025-05-07T20:26:32.6076115Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:26:32.6076443Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:26:32.6076744Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:26:32.6077051Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:26:32.6077327Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:26:32.6077593Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:26:32.6078436Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:26:32.6079053Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:26:32.6079484Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:26:32.6079808Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:26:32.6080120Z #define __cudaCDP2GetErrorName 
2025-05-07T20:26:32.6080400Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:26:32.6080667Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:26:32.6080983Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:26:32.6081321Z #define __cpp_variadic_templates 200704L
2025-05-07T20:26:32.6081618Z #define RAND_MAX 2147483647
2025-05-07T20:26:32.6081888Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:26:32.6082221Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:32.6082545Z #define __SM_90_RT_H__ 
2025-05-07T20:26:32.6083916Z nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
2025-05-07T20:26:32.6084671Z 
2025-05-07T20:26:32.6084770Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:26:32.6085336Z #define __COMPAR_FN_T 
2025-05-07T20:26:32.6085573Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:26:32.6085835Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:26:32.6086326Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:26:32.6086855Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:26:32.6087198Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:26:32.6087573Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:26:32.6087881Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:32.6088223Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:26:32.6088547Z #define __cpp_variable_templates 201304L
2025-05-07T20:26:32.6089079Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:32.6089650Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:26:32.6089987Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:26:32.6090276Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:26:32.6090595Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:26:32.6090905Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:26:32.6091181Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:26:32.6091458Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:26:32.6091722Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:26:32.6091976Z #define __u_char_defined 
2025-05-07T20:26:32.6092327Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:26:32.6092701Z #define STA_PPSERROR 0x0800
2025-05-07T20:26:32.6092960Z #define _GLIBCXX_STD_A std
2025-05-07T20:26:32.6093222Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:26:32.6093511Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:26:32.6093963Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:26:32.6094407Z #define FP_INFINITE 1
2025-05-07T20:26:32.6094787Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:32.6095319Z #define _IO_pid_t __pid_t
2025-05-07T20:26:32.6095685Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:26:32.6096041Z #define __LEAF , __leaf__
2025-05-07T20:26:32.6096356Z #define PATH_MAX 4096
2025-05-07T20:26:32.6096689Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:26:32.6097152Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:26:32.6097501Z #define _LIMITS_H___ 
2025-05-07T20:26:32.6097722Z #define __size_t 
2025-05-07T20:26:32.6098012Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:26:32.6098746Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:26:32.6099578Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:26:32.6099953Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:26:32.6100294Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:26:32.6100554Z #define _WCHAR_T_DEFINED 
2025-05-07T20:26:32.6100921Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:26:32.6101552Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:26:32.6101856Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:26:32.6102178Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:26:32.6102462Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:26:32.6102743Z #define __INT8_C(c) c
2025-05-07T20:26:32.6102993Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:26:32.6103292Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:26:32.6103556Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:26:32.6103808Z #define __SM_70_RT_HPP__ 
2025-05-07T20:26:32.6104061Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:26:32.6104342Z #define __cpp_variadic_using 201611L
2025-05-07T20:26:32.6104962Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:32.6105289Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:26:32.6105561Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:26:32.6105830Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:26:32.6106089Z #define __cpp_capture_star_this 201603L
2025-05-07T20:26:32.6106411Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:26:32.6106851Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:26:32.6107214Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:26:32.6107602Z #define NFDBITS __NFDBITS
2025-05-07T20:26:32.6107861Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:26:32.6108274Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:26:32.6108601Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:26:32.6108922Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:26:32.6109173Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:26:32.6109464Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:26:32.6109932Z #define STA_UNSYNC 0x0040
2025-05-07T20:26:32.6110247Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:32.6110679Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:26:32.6111048Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:26:32.6111339Z #define __cpp_if_constexpr 201606L
2025-05-07T20:26:32.6111665Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:26:32.6111996Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:26:32.6112321Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:26:32.6112657Z #define __daddr_t_defined 
2025-05-07T20:26:32.6112909Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:26:32.6113184Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:26:32.6113498Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:26:32.6114036Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:26:32.6114551Z #define _ACRTIMP 
2025-05-07T20:26:32.6114772Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:26:32.6115046Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:26:32.6115345Z #define _IOS_BIN 128
2025-05-07T20:26:32.6115704Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:26:32.6116129Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:32.6116418Z #define UNDERFLOW 4
2025-05-07T20:26:32.6116637Z #define NAME_MAX 255
2025-05-07T20:26:32.6116871Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:26:32.6117146Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:26:32.6117430Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:26:32.6117721Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:26:32.6118111Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:26:32.6118516Z #define __ptr_t void *
2025-05-07T20:26:32.6118758Z #define M_E 2.7182818284590452354
2025-05-07T20:26:32.6119041Z #define cudaSurfaceType1D 0x01
2025-05-07T20:26:32.6119311Z #define __USE_ISOCXX11 1
2025-05-07T20:26:32.6119581Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:26:32.6119901Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:26:32.6120202Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:26:32.6120486Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:26:32.6120771Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:26:32.6121248Z #define cudaSurfaceType2D 0x02
2025-05-07T20:26:32.6121512Z #define __linux 1
2025-05-07T20:26:32.6121727Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:26:32.6121998Z #define cudaDeviceMask 0xff
2025-05-07T20:26:32.6122264Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:26:32.6122549Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:26:32.6122832Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:26:32.6123133Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:26:32.6123441Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:26:32.6123753Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:26:32.6124054Z #define _BITS_TYPES_H 1
2025-05-07T20:26:32.6124349Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:26:32.6124694Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:26:32.6125002Z #define cudaSurfaceType3D 0x03
2025-05-07T20:26:32.6125289Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:26:32.6125576Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:26:32.6125965Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:26:32.6126838Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:26:32.6127691Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:26:32.6127980Z #define __unix 1
2025-05-07T20:26:32.6128192Z #define MATH_ERRNO 1
2025-05-07T20:26:32.6128424Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:26:32.6128704Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:26:32.6128968Z #define __SM_100_RT_H__ 
2025-05-07T20:26:32.6129215Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:26:32.6129496Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:26:32.6129784Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:26:32.6130058Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:26:32.6130355Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:26:32.6130839Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:26:32.6131328Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:26:32.6131623Z #define CUDARTAPI_CDECL 
2025-05-07T20:26:32.6131882Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:26:32.6132157Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:26:32.6132445Z #define __cpp_lib_void_t 201411
2025-05-07T20:26:32.6132716Z #define _POSIX_AIO_MAX 1
2025-05-07T20:26:32.6132956Z #define __SIZE_T 
2025-05-07T20:26:32.6133208Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:26:32.6133530Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:26:32.6133834Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:26:32.6134102Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:26:32.6134367Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:26:32.6134634Z #define _ATFILE_SOURCE 1
2025-05-07T20:26:32.6135036Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:26:32.6135481Z #define __WAIT_STATUS void *
2025-05-07T20:26:32.6135759Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:26:32.6136029Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:26:32.6136295Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:26:32.6136585Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:26:32.6136868Z #define __WINT_MIN__ 0U
2025-05-07T20:26:32.6137474Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:26:32.6138363Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:26:32.6138673Z #define WUNTRACED 2
2025-05-07T20:26:32.6138903Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:26:32.6139171Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:26:32.6139459Z #define NZERO 20
2025-05-07T20:26:32.6139684Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:26:32.6139976Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:26:32.6140373Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:26:32.6140728Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:26:32.6141097Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:26:32.6141392Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:26:32.6141749Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:26:32.6142025Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:26:32.6142300Z #define EXIT_FAILURE 1
2025-05-07T20:26:32.6142538Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:26:32.6142798Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:26:32.6143060Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:26:32.6143315Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:26:32.6143596Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:26:32.6143930Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:26:32.6144293Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:26:32.6144591Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:26:32.6144840Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:26:32.6145121Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:26:32.6145422Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:26:32.6145738Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:26:32.6147729Z #define SEEK_DATA 3
2025-05-07T20:26:32.6147958Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:26:32.6148247Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:26:32.6148682Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:26:32.6149082Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:26:32.6149333Z #define __INT64_C(c) c ## L
2025-05-07T20:26:32.6149596Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:26:32.6150065Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:26:32.6150398Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:26:32.6150667Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:26:32.6150970Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:26:32.6151282Z #define STA_PPSWANDER 0x0400
2025-05-07T20:26:32.6151541Z #define __INT_WCHAR_T_H 
2025-05-07T20:26:32.6151791Z #define WSTOPPED 2
2025-05-07T20:26:32.6152029Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:26:32.6152330Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:26:32.6152588Z #define FP_NORMAL 4
2025-05-07T20:26:32.6152832Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:26:32.6153116Z #define _BITS_TIMEX_H 1
2025-05-07T20:26:32.6153360Z #define _POSIX_LINK_MAX 8
2025-05-07T20:26:32.6153628Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:26:32.6153924Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:26:32.6154199Z #define cudaTextureType1D 0x01
2025-05-07T20:26:32.6154474Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:26:32.6154743Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:26:32.6155015Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:26:32.6155321Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:26:32.6155767Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:26:32.6156253Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:26:32.6156523Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:26:32.6156791Z #define _POSIX_SOURCE 1
2025-05-07T20:26:32.6157051Z #define cudaTextureType2D 0x02
2025-05-07T20:26:32.6157317Z #define _PTR_TRAITS_H 1
2025-05-07T20:26:32.6157592Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:26:32.6157917Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:26:32.6158186Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:26:32.6158516Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:26:32.6158865Z #define cudaTextureType3D 0x03
2025-05-07T20:26:32.6159136Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:26:32.6159402Z #define CLOCK_REALTIME 0
2025-05-07T20:26:32.6159655Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:26:32.6159929Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:26:32.6160244Z #define __cpp_aligned_new 201606L
2025-05-07T20:26:32.6160529Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:26:32.6160806Z #define cudaEventBlockingSync 0x01
2025-05-07T20:26:32.6161099Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:26:32.6161377Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:26:32.6161772Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:26:32.6162075Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:26:32.6162354Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:26:32.6174707Z #define __GLIBC__ 2
2025-05-07T20:26:32.6174997Z #define __END_DECLS }
2025-05-07T20:26:32.6175248Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:26:32.6175632Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:26:32.6176030Z #define __CONCAT(x,y) x ## y
2025-05-07T20:26:32.6176281Z #define WCONTINUED 8
2025-05-07T20:26:32.6176505Z #define __STDC_HOSTED__ 1
2025-05-07T20:26:32.6176753Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:26:32.6177020Z #define _ALLOCA_H 1
2025-05-07T20:26:32.6177238Z #define __host__ __location__(host)
2025-05-07T20:26:32.6177666Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:26:32.6178113Z #define __SLONG32_TYPE int
2025-05-07T20:26:32.6178370Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:26:32.6178902Z #define _SYS_SELECT_H 1
2025-05-07T20:26:32.6179135Z #define _IO_LINE_BUF 0x200
2025-05-07T20:26:32.6179378Z #define _IOS_NOCREATE 32
2025-05-07T20:26:32.6179623Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:26:32.6179895Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:26:32.6180186Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:26:32.6180465Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:26:32.6180751Z #define __global__ __location__(global)
2025-05-07T20:26:32.6181035Z #define __GNU_LIBRARY__ 6
2025-05-07T20:26:32.6181282Z #define __cpp_decltype_auto 201304L
2025-05-07T20:26:32.6181559Z #define __DBL_DIG__ 15
2025-05-07T20:26:32.6181778Z #define TIME_UTC 1
2025-05-07T20:26:32.6181995Z #define __FLT32_DIG__ 6
2025-05-07T20:26:32.6182321Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:26:32.6182729Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:26:32.6183426Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:26:32.6183756Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:26:32.6184069Z #define _G_BUFSIZ 8192
2025-05-07T20:26:32.6184381Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:26:32.6184755Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:26:32.6185060Z #define __cudaCDP2GetDevice 
2025-05-07T20:26:32.6185347Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:26:32.6185638Z #define STA_CLOCKERR 0x1000
2025-05-07T20:26:32.6185888Z #define __GXX_WEAK__ 1
2025-05-07T20:26:32.6186146Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:32.6186455Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:26:32.6186712Z #define __SHRT_WIDTH__ 16
2025-05-07T20:26:32.6187001Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:26:32.6187338Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:26:32.6187614Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:26:32.6187906Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:26:32.6188203Z #define _G_config_h 1
2025-05-07T20:26:32.6188485Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:26:32.6188834Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:26:32.6189107Z #define _GCC_WCHAR_T 
2025-05-07T20:26:32.6189343Z #define TMP_MAX 238328
2025-05-07T20:26:32.6189585Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:26:32.6189971Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:26:32.6190236Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:32.6190518Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:26:32.6190793Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:26:32.6191090Z #define _IO_SKIPWS 01
2025-05-07T20:26:32.6191509Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:26:32.6191993Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:26:32.6192256Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:26:32.6192599Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:26:32.6192982Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:26:32.6193611Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:26:32.6193993Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:26:32.6194247Z #define le32toh(x) (x)
2025-05-07T20:26:32.6194473Z #define _SIZE_T_DEFINED 
2025-05-07T20:26:32.6194723Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:26:32.6195061Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:26:32.6195409Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:26:32.6195821Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:26:32.6196251Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:26:32.6196513Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:26:32.6196768Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:26:32.6197029Z #define _POSIX_NAME_MAX 14
2025-05-07T20:26:32.6197304Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:26:32.6197842Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:26:32.6198365Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:26:32.6198814Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:26:32.6199422Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:26:32.6199749Z #define _WCHAR_T_ 
2025-05-07T20:26:32.6199974Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:26:32.6200353Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:26:32.6200747Z #define RTSIG_MAX 32
2025-05-07T20:26:32.6200971Z #define _STDDEF_H 
2025-05-07T20:26:32.6201204Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:26:32.6201468Z #define _VA_LIST_DEFINED 
2025-05-07T20:26:32.6201719Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:26:32.6202055Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:26:32.6202445Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:26:32.6202781Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:26:32.6203074Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:26:32.6203553Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:26:32.6204108Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:26:32.6204493Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:26:32.6204821Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:26:32.6205142Z #define __unix__ 1
2025-05-07T20:26:32.6205379Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:32.6205667Z #define __INT_WIDTH__ 32
2025-05-07T20:26:32.6205909Z #define __SIZEOF_LONG__ 8
2025-05-07T20:26:32.6206153Z #define _IONBF 2
2025-05-07T20:26:32.6206615Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:26:32.6207422Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:26:32.6207989Z #define __STDC_IEC_559__ 1
2025-05-07T20:26:32.6208247Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:26:32.6208524Z #define __UINT16_C(c) c
2025-05-07T20:26:32.6208773Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:26:32.6209051Z #define STA_DEL 0x0020
2025-05-07T20:26:32.6209297Z #define __CUDACC_VER_MINOR__ 8
2025-05-07T20:26:32.6209553Z #define __id_t_defined 
2025-05-07T20:26:32.6209828Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:26:32.6210302Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:26:32.6210747Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:26:32.6211023Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:26:32.6211290Z #define __DECIMAL_DIG__ 21
2025-05-07T20:26:32.6211547Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:26:32.6211819Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:26:32.6212090Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:26:32.6212356Z #define SING 2
2025-05-07T20:26:32.6212575Z #define STA_FREQHOLD 0x0080
2025-05-07T20:26:32.6212848Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:32.6213159Z #define cudaStreamDefault 0x00
2025-05-07T20:26:32.6213633Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:26:32.6214022Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:26:32.6214295Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:26:32.6214558Z #define __gnu_linux__ 1
2025-05-07T20:26:32.6214795Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:26:32.6215051Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:26:32.6215352Z #define MAX_INPUT 255
2025-05-07T20:26:32.6215601Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:26:32.6215934Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:26:32.6216314Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:26:32.6216674Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:26:32.6216951Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:26:32.6217361Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:26:32.6217797Z #define _IO_SHOWPOS 02000
2025-05-07T20:26:32.6218131Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:26:32.6218606Z #define _Mfloat_ float
2025-05-07T20:26:32.6218863Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:26:32.6219177Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:26:32.6219472Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:26:32.6219790Z #define cudaMemPoolCreateUsageHwDecompress 0x2
2025-05-07T20:26:32.6220351Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:26:32.6220874Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:32.6221161Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:26:32.6221491Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:26:32.6221861Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:26:32.6222169Z #define __USE_ISOC11 1
2025-05-07T20:26:32.6222396Z #define _BSD_SIZE_T_ 
2025-05-07T20:26:32.6222632Z #define ADJ_MICRO 0x1000
2025-05-07T20:26:32.6222887Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:26:32.6223162Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:26:32.6223467Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:26:32.6223798Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:26:32.6224107Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:26:32.6224452Z #define __THROW throw ()
2025-05-07T20:26:32.6224712Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:26:32.6225007Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:32.6225371Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:26:32.6225740Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:26:32.6226024Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:26:32.6226555Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:26:32.6226824Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:26:32.6227086Z #define L_tmpnam 20
2025-05-07T20:26:32.6227301Z #define ___int_wchar_t_h 
2025-05-07T20:26:32.6227648Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:26:32.6228046Z #define isascii(c) __isascii (c)
2025-05-07T20:26:32.6228313Z #define _T_PTRDIFF 
2025-05-07T20:26:32.6228630Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:26:32.6229002Z #define toascii(c) __toascii (c)
2025-05-07T20:26:32.6229254Z #define __GNUC__ 11
2025-05-07T20:26:32.6229504Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:26:32.6230081Z #define __GXX_RTTI 1
2025-05-07T20:26:32.6230303Z #define __pie__ 2
2025-05-07T20:26:32.6230517Z #define __MMX__ 1
2025-05-07T20:26:32.6230738Z #define __cudaCDP2Malloc 
2025-05-07T20:26:32.6230997Z #define __timespec_defined 1
2025-05-07T20:26:32.6231242Z #define L_ctermid 9
2025-05-07T20:26:32.6231474Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:32.6231783Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:26:32.6232176Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:26:32.6232565Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:26:32.6232838Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:26:32.6233231Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:26:32.6233550Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:26:32.6233876Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:26:32.6234142Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:26:32.6234611Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:26:32.6235411Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:32.6236059Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:26:32.6236371Z #define __USE_SVID 1
2025-05-07T20:26:32.6236628Z #define __constant__ __location__(constant)
2025-05-07T20:26:32.6236969Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:26:32.6237282Z #define __device__ __location__(device)
2025-05-07T20:26:32.6237614Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:26:32.6237953Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:26:32.6238239Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:26:32.6238600Z #define CUDART_DEVICE __device__
2025-05-07T20:26:32.6238955Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:26:32.6239333Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:26:32.6239615Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:26:32.6239999Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:26:32.6240403Z #define __STDC_UTF_16__ 1
2025-05-07T20:26:32.6240654Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:26:32.6241038Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:26:32.6241489Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:26:32.6241821Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:26:32.6242093Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:26:32.6242457Z #define NGROUPS_MAX 65536
2025-05-07T20:26:32.6242789Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:26:32.6243061Z #define __USE_ISOC95 1
2025-05-07T20:26:32.6243290Z #define _TIME_H 1
2025-05-07T20:26:32.6243651Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:26:32.6243978Z #define __USE_ISOC99 1
2025-05-07T20:26:32.6244308Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:26:32.6244687Z #define HOST_NAME_MAX 64
2025-05-07T20:26:32.6244934Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:26:32.6245197Z #define _IOS_ATEND 4
2025-05-07T20:26:32.6245432Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:26:32.6245753Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:26:32.6246164Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:32.6246514Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:26:32.6246800Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:26:32.6247117Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:26:32.6247438Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:26:32.6247697Z #define _STDIO_H 1
2025-05-07T20:26:32.6248109Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:26:32.6248603Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:26:32.6248974Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:32.6249354Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:26:32.6249650Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:26:32.6249926Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:26:32.6250196Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:26:32.6250486Z #define __cpp_raw_strings 200710L
2025-05-07T20:26:32.6250791Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:32.6251113Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:26:32.6251382Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:26:32.6251666Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:26:32.6251978Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:26:32.6252247Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:26:32.6252540Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:26:32.6253099Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:26:32.6253487Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:26:32.6253735Z #define __USE_XOPEN 1
2025-05-07T20:26:32.6253980Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:26:32.6254431Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:32.6254891Z #define __USE_XOPEN2K 1
2025-05-07T20:26:32.6255134Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:26:32.6255402Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:26:32.6255699Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:26:32.6255972Z #define __cpp_fold_expressions 201603L
2025-05-07T20:26:32.6256518Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:26:32.6257064Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:26:32.6257355Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:26:32.6257722Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:26:32.6258246Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:26:32.6258632Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:26:32.6259039Z #define __END_NAMESPACE_C99 
2025-05-07T20:26:32.6259312Z #define __glibcxx_integral_traps true
2025-05-07T20:26:32.6259595Z #define _POSIX_PATH_MAX 256
2025-05-07T20:26:32.6260130Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:26:32.6260390Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:26:32.6260650Z #define _IOS_TRUNC 16
2025-05-07T20:26:32.6260879Z #define _ISOC11_SOURCE 1
2025-05-07T20:26:32.6261131Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:26:32.6261419Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:26:32.6261721Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:26:32.6262098Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:26:32.6262488Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:26:32.6262768Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:26:32.6263032Z #define _IO_UNITBUF 020000
2025-05-07T20:26:32.6263293Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:26:32.6263556Z #define __FD_SETSIZE 1024
2025-05-07T20:26:32.6263811Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:26:32.6264088Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:26:32.6264432Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:26:32.6264798Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:26:32.6265068Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:26:32.6265376Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:26:32.6265703Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:26:32.6265978Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:26:32.6266282Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:26:32.6266634Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:26:32.6266928Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:26:32.6267263Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:26:32.6267563Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:26:32.6267845Z #define __USE_POSIX199506 1
2025-05-07T20:26:32.6268111Z #define _FEATURES_H 1
2025-05-07T20:26:32.6268352Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:26:32.6268766Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:26:32.6269270Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:26:32.6269616Z #define __stub_getmsg 
2025-05-07T20:26:32.6270008Z #define _IO_FIXED 010000
2025-05-07T20:26:32.6270288Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:26:32.6270606Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:26:32.6270884Z #define __stub_setlogin 
2025-05-07T20:26:32.6271123Z #define __stub_fattach 
2025-05-07T20:26:32.6271359Z #define __cplusplus 201703L
2025-05-07T20:26:32.6271631Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:26:32.6271920Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:26:32.6272173Z #define INFINITY (__builtin_inff())
2025-05-07T20:26:32.6272457Z #define _IO_UNBUFFERED 2
2025-05-07T20:26:32.6273073Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:26:32.6273639Z #define _IO_INTERNAL 010
2025-05-07T20:26:32.6273882Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:26:32.6274221Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:32.6274590Z #define __dev_t_defined 
2025-05-07T20:26:32.6274820Z #define __DEPRECATED 1
2025-05-07T20:26:32.6275048Z #define __S32_TYPE int
2025-05-07T20:26:32.6275295Z #define __cpp_rvalue_references 200610L
2025-05-07T20:26:32.6275588Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:26:32.6275848Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:26:32.6276102Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:26:32.6276742Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:26:32.6277415Z #define _G_HAVE_MREMAP 1
2025-05-07T20:26:32.6277732Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:32.6278085Z #define OVERFLOW 3
2025-05-07T20:26:32.6278420Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:26:32.6278735Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:26:32.6279023Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:32.6279359Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:26:32.6279697Z #define __SSE2_MATH__ 1
2025-05-07T20:26:32.6279949Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:26:32.6280255Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:32.6280561Z #define _IO_STDIO_H 
2025-05-07T20:26:32.6280808Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:26:32.6281096Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:26:32.6281424Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:26:32.6281724Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:32.6282036Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:26:32.6282305Z #define __amd64 1
2025-05-07T20:26:32.6282529Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:26:32.6283049Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:26:32.6283385Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:26:32.6283680Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:26:32.6283994Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:26:32.6284256Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:26:32.6284560Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:26:32.6284830Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:26:32.6285078Z #define __bounded 
2025-05-07T20:26:32.6285302Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:26:32.6285578Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:26:32.6285866Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:26:32.6286392Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:26:32.6286667Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:26:32.6286941Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:32.6287271Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:26:32.6287698Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:32.6288114Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:26:32.6288394Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:26:32.6288740Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:26:32.6289099Z #define STA_PLL 0x0001
2025-05-07T20:26:32.6289470Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:26:32.6289742Z #define __GNUG__ 11
2025-05-07T20:26:32.6289973Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:26:32.6290230Z #define _T_WCHAR 
2025-05-07T20:26:32.6290466Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:26:32.6290756Z #define __specialization_static 
2025-05-07T20:26:32.6291056Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:26:32.6291372Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:26:32.6291635Z #define cudaArraySparse 0x40
2025-05-07T20:26:32.6291893Z #define STA_PPSFREQ 0x0002
2025-05-07T20:26:32.6292176Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:26:32.6292484Z #define _WCHAR_T 
2025-05-07T20:26:32.6292707Z #define __cudaCDP2Free 
2025-05-07T20:26:32.6293620Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:26:32.6294371Z #define __cpp_nsdmi 200809L
2025-05-07T20:26:32.6294806Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:26:32.6295266Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:26:32.6295550Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:26:32.6295814Z #define cudaArrayCubemap 0x04
2025-05-07T20:26:32.6296153Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:32.6296506Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:26:32.6296752Z #define __NO_CTYPE 1
2025-05-07T20:26:32.6296979Z #define __stub_bdflush 
2025-05-07T20:26:32.6297361Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:26:32.6297801Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:26:32.6298113Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:26:32.6298508Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:26:32.6298785Z #define __cpp_initializer_lists 200806L
2025-05-07T20:26:32.6299096Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:26:32.6299410Z #define __U16_TYPE unsigned short int
2025-05-07T20:26:32.6307942Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:26:32.6308320Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:26:32.6308614Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:26:32.6308912Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:26:32.6309273Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:26:32.6309639Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:26:32.6310133Z #define _IO_STDIO 040000
2025-05-07T20:26:32.6310472Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:26:32.6310877Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:26:32.6311195Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:26:32.6311505Z #define _PTRDIFF_T 
2025-05-07T20:26:32.6311735Z #define _MOVE_H 1
2025-05-07T20:26:32.6311964Z #define __cpp_hex_float 201603L
2025-05-07T20:26:32.6312238Z #define ADJ_TAI 0x0080
2025-05-07T20:26:32.6312467Z #define __ptrvalue 
2025-05-07T20:26:32.6312701Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:26:32.6312966Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:26:32.6313259Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:26:32.6313564Z #define MATH_ERREXCEPT 2
2025-05-07T20:26:32.6313826Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:26:32.6314115Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:26:32.6314524Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:26:32.6314933Z #define __USE_GNU 1
2025-05-07T20:26:32.6315174Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:26:32.6315454Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:26:32.6315732Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:26:32.6316141Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:26:32.6316549Z #define WEXITED 4
2025-05-07T20:26:32.6316768Z #define _IO_NO_READS 4
2025-05-07T20:26:32.6317071Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:26:32.6317430Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:26:32.6317715Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:26:32.6318026Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:26:32.6318352Z #define __uid_t_defined 
2025-05-07T20:26:32.6318886Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:26:32.6319185Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:26:32.6319458Z #define WNOHANG 1
2025-05-07T20:26:32.6319706Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:26:32.6320021Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:26:32.6320295Z #define cudaEventDefault 0x00
2025-05-07T20:26:32.6320594Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:26:32.6320922Z #define NL_SETMAX INT_MAX
2025-05-07T20:26:32.6321165Z #define __x86_64 1
2025-05-07T20:26:32.6321566Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:26:32.6321975Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:32.6322478Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:26:32.6322992Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:32.6323445Z #define __PTRDIFF_T 
2025-05-07T20:26:32.6323771Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:26:32.6324154Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:26:32.6324431Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:32.6324728Z #define _Mlong_double_ long double
2025-05-07T20:26:32.6325015Z #define __cpp_lambdas 200907L
2025-05-07T20:26:32.6325262Z #define _IO_DEC 020
2025-05-07T20:26:32.6325487Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:26:32.6325759Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:26:32.6326048Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:26:32.6326434Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:26:32.6326697Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:26:32.6326991Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:26:32.6327321Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:26:32.6327598Z #define _ANSI_STDDEF_H 
2025-05-07T20:26:32.6327872Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:26:32.6328191Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:26:32.6328574Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:26:32.6328978Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:26:32.6329258Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:26:32.6329555Z #define __cpp_template_auto 201606L
2025-05-07T20:26:32.6329923Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:26:32.6330302Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:26:32.6330576Z #define __key_t_defined 
2025-05-07T20:26:32.6330827Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:26:32.6331210Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:26:32.6331701Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:26:32.6332089Z #define __GNUC_VA_LIST 
2025-05-07T20:26:32.6332426Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:32.6332830Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:26:32.6333102Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:26:32.6333385Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:26:32.6333685Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:26:32.6333940Z #define __WCOREFLAG 0x80
2025-05-07T20:26:32.6334197Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:26:32.6334508Z #define cudaEventDisableTiming 0x02
2025-05-07T20:26:32.6334796Z #define __LP64__ 1
2025-05-07T20:26:32.6335044Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:26:32.6335365Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:26:32.6335659Z #define _IO_off64_t __off64_t
2025-05-07T20:26:32.6335937Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:32.6336194Z #define __time_t_defined 1
2025-05-07T20:26:32.6336456Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:26:32.6336867Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:26:32.6337263Z #define __USE_UNIX98 1
2025-05-07T20:26:32.6337506Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:26:32.6337782Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:26:32.6338056Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:26:32.6338353Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:26:32.6338672Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:26:32.6338936Z #define SEEK_CUR 1
2025-05-07T20:26:32.6339162Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:32.6339440Z #define _ASSERT_H 1
2025-05-07T20:26:32.6340042Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:26:32.6340858Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:26:32.6341138Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:26:32.6341399Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:26:32.6341665Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:26:32.6341938Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:26:32.6342324Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:32.6342751Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:26:32.6343438Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:26:32.6344142Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:26:32.6344545Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:26:32.6344963Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:26:32.6345401Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:26:32.6345717Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:26:32.6346124Z #define cudaArrayDefault 0x00
2025-05-07T20:26:32.6346708Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:26:32.6347017Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:26:32.6347310Z #define TLOSS 5
2025-05-07T20:26:32.6347525Z #define __ssize_t_defined 
2025-05-07T20:26:32.6347787Z #define __CUDACC_VER_BUILD__ 61
2025-05-07T20:26:32.6348069Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:26:32.6348369Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:26:32.6348655Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:26:32.6349087Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:26:32.6349375Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:26:32.6349819Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:26:32.6350124Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:26:32.6350418Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:26:32.6350714Z #define __REGISTER_PREFIX__ 
2025-05-07T20:26:32.6350976Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:26:32.6351323Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:26:32.6351701Z #define _IOS_NOREPLACE 64
2025-05-07T20:26:32.6351946Z #define __cdecl 
2025-05-07T20:26:32.6352188Z #define cudaEventInterprocess 0x04
2025-05-07T20:26:32.6352522Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:26:32.6352865Z #define LOGIN_NAME_MAX 256
2025-05-07T20:26:32.6353124Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:26:32.6353394Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:26:32.6353696Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:26:32.6353970Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:26:32.6354281Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:26:32.6354622Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:26:32.6355042Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:32.6355496Z #define ADJ_NANO 0x2000
2025-05-07T20:26:32.6355801Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:26:32.6356184Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:26:32.6356475Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:26:32.6356734Z #define __FLT_DIG__ 6
2025-05-07T20:26:32.6357095Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:26:32.6357510Z #define __NO_INLINE__ 1
2025-05-07T20:26:32.6357810Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:32.6358175Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:26:32.6358436Z #define ADJ_STATUS 0x0010
2025-05-07T20:26:32.6358697Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:26:32.6358994Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:26:32.6359268Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:32.6359573Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:26:32.6359862Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:26:32.6360256Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:26:32.6360689Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:26:32.6361157Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:26:32.6361523Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:26:32.6361766Z #define MAX_CANON 255
2025-05-07T20:26:32.6361995Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:26:32.6362254Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:26:32.6362523Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:26:32.6362809Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:26:32.6363120Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:26:32.6363428Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:26:32.6363705Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:26:32.6364028Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:26:32.6364351Z #define __VERSION__ "11.4.0"
2025-05-07T20:26:32.6364618Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:26:32.6364909Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:26:32.6365206Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:26:32.6365495Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:26:32.6365899Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:26:32.6366206Z #define __UINT64_C(c) c ## UL
2025-05-07T20:26:32.6366468Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:26:32.6366719Z #define _SYS_TYPES_H 1
2025-05-07T20:26:32.6366961Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:26:32.6367229Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:26:32.6367478Z #define _SYS_CDEFS_H 1
2025-05-07T20:26:32.6367719Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:26:32.6367996Z #define __cpp_unicode_characters 201411L
2025-05-07T20:26:32.6368291Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:26:32.6368545Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:26:32.6368866Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:26:32.6369139Z #define FP_SUBNORMAL 3
2025-05-07T20:26:32.6369383Z #define cudaOccupancyDefault 0x00
2025-05-07T20:26:32.6369666Z #define _INITIALIZER_LIST 
2025-05-07T20:26:32.6369917Z #define _STDC_PREDEF_H 1
2025-05-07T20:26:32.6370171Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:26:32.6370478Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:26:32.6370743Z #define _IO_file_flags _flags
2025-05-07T20:26:32.6371004Z #define __USE_XOPEN2K8 1
2025-05-07T20:26:32.6371249Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:26:32.6371531Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:26:32.6371813Z #define HUGE 3.40282347e+38F
2025-05-07T20:26:32.6372075Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:26:32.6372473Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:26:32.6372886Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:26:32.6373195Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:26:32.6373470Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:26:32.6373730Z #define _BSD_SOURCE 1
2025-05-07T20:26:32.6373958Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:26:32.6374867Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:26:32.6375795Z #define __catch(X) catch(X)
2025-05-07T20:26:32.6376059Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:26:32.6376348Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:26:32.6376624Z #define __TIMER_T_TYPE void *
2025-05-07T20:26:32.6376878Z #define __STRING(x) #x
2025-05-07T20:26:32.6377112Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:26:32.6377393Z #define _T_PTRDIFF_ 
2025-05-07T20:26:32.6377636Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:26:32.6377938Z #define cudaEventWaitExternal 0x01
2025-05-07T20:26:32.6378216Z #define __unbounded 
2025-05-07T20:26:32.6378459Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:32.6378750Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:26:32.6379034Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:32.6379344Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:26:32.6379626Z #define __cpp_lib_is_final 201402L
2025-05-07T20:26:32.6380017Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:26:32.6380355Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:26:32.6380874Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:26:32.6381152Z #define __managed__ __location__(managed)
2025-05-07T20:26:32.6381457Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:26:32.6381869Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:32.6382308Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:26:32.6382571Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:26:32.6383238Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:26:32.6383662Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:26:32.6383918Z #define _SYS_SIZE_T_H 
2025-05-07T20:26:32.6384216Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:26:32.6384568Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:26:32.6384846Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:26:32.6385152Z #define _CRTIMP 
2025-05-07T20:26:32.6385647Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:26:32.6385950Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:32.6386297Z #define STA_PPSJITTER 0x0200
2025-05-07T20:26:32.6386663Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:26:32.6387082Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:32.6387411Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:26:32.6387700Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:26:32.6387990Z #define __SIZE_T__ 
2025-05-07T20:26:32.6388208Z #define __stub_gtty 
2025-05-07T20:26:32.6388442Z #define __pid_t_defined 
2025-05-07T20:26:32.6388711Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:26:32.6389006Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:32.6389329Z #define __glibcxx_function_requires(...) 
2025-05-07T20:26:32.6389632Z #define __SM_80_RT_HPP__ 
2025-05-07T20:26:32.6389971Z #define __need_clockid_t 
2025-05-07T20:26:32.6390233Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:26:32.6390494Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:26:32.6390809Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:26:32.6391134Z #define _IO_HEX 0100
2025-05-07T20:26:32.6391393Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:26:32.6391730Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:26:32.6391834Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:26:32.6391935Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:26:32.6392162Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:32.6392284Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:26:32.6392389Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:26:32.6392495Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:26:32.6392598Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:26:32.6392698Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:26:32.6392785Z #define __stub_sstk 
2025-05-07T20:26:32.6392876Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:26:32.6393042Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:26:32.6393130Z #define __wur 
2025-05-07T20:26:32.6393247Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:26:32.6393334Z #define _G_HAVE_MMAP 1
2025-05-07T20:26:32.6393422Z #define _IO_OCT 040
2025-05-07T20:26:32.6393514Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:26:32.6393601Z #define NL_MSGMAX INT_MAX
2025-05-07T20:26:32.6393697Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:26:32.6393823Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:26:32.6393921Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:26:32.6394023Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:26:32.6394215Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:26:32.6394315Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:26:32.6394404Z #define _STL_ALGOBASE_H 1
2025-05-07T20:26:32.6394513Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:26:32.6394606Z #define __off64_t_defined 
2025-05-07T20:26:32.6394849Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:26:32.6394938Z #define __FLT128_DIG__ 33
2025-05-07T20:26:32.6395046Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:26:32.6395143Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:26:32.6395233Z #define __INT32_C(c) c
2025-05-07T20:26:32.6395327Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:26:32.6395424Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:26:32.6395523Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:26:32.6395615Z #define __PDP_ENDIAN 3412
2025-05-07T20:26:32.6395701Z #define _ISOC95_SOURCE 1
2025-05-07T20:26:32.6395803Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:26:32.6395934Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:26:32.6396028Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:26:32.6396122Z #define __SM_90_RT_HPP__ 
2025-05-07T20:26:32.6396218Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:26:32.6396312Z #define __have_pthread_attr_t 1
2025-05-07T20:26:32.6396423Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:26:32.6396732Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:26:32.6396844Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:26:32.6396944Z #define __cudaCDP2EventRecord 
2025-05-07T20:26:32.6397037Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:26:32.6397128Z #define htole32(x) (x)
2025-05-07T20:26:32.6397381Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:26:32.6397501Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:26:32.6397607Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:26:32.6397764Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:26:32.6397902Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:26:32.6398036Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:26:32.6398175Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:26:32.6398273Z #define ADJ_OFFSET 0x0001
2025-05-07T20:26:32.6398381Z #define cudaArrayLayered 0x01
2025-05-07T20:26:32.6398558Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:26:32.6398672Z #define cudaEventRecordDefault 0x00
2025-05-07T20:26:32.6398766Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:26:32.6398866Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:26:32.6398951Z #define unix 1
2025-05-07T20:26:32.6399046Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:26:32.6399137Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:26:32.6399236Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:26:32.6399352Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:26:32.6399437Z #define __USE_POSIX 1
2025-05-07T20:26:32.6399540Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:26:32.6399670Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:26:32.6399766Z #define __THROWNL throw ()
2025-05-07T20:26:32.6399857Z #define __cpp_rtti 199711L
2025-05-07T20:26:32.6399958Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:26:32.6400052Z #define __PMT(args) args
2025-05-07T20:26:32.6400175Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:32.6400322Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:26:32.6400441Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:26:32.6400532Z #define _SIZE_T_DECLARED 
2025-05-07T20:26:32.6400629Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:26:32.6400729Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:26:32.6401150Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:26:32.6401259Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:26:32.6401352Z #define XATTR_LIST_MAX 65536
2025-05-07T20:26:32.6401449Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:26:32.6401596Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:26:32.6401679Z #define _WCHAR_T_H 
2025-05-07T20:26:32.6401769Z #define __FLT64X_DIG__ 18
2025-05-07T20:26:32.6401863Z #define _IO_SHOWBASE 0200
2025-05-07T20:26:32.6401949Z #define _POSIX_QLIMIT 1
2025-05-07T20:26:32.6402184Z #define __INT8_TYPE__ signed char
2025-05-07T20:26:32.6402289Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:26:32.6402380Z #define __CUDA_ARCH__ 520
2025-05-07T20:26:32.6402485Z #define __cpp_digit_separators 201309L
2025-05-07T20:26:32.6402571Z #define __ELF__ 1
2025-05-07T20:26:32.6402671Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:26:32.6402778Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:26:32.6402863Z #define STA_INS 0x0010
2025-05-07T20:26:32.6402961Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:26:32.6403141Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:26:32.6403232Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:26:32.6403328Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:26:32.6403445Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:32.6403551Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:26:32.6403646Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:26:32.6403754Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:26:32.6403937Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:26:32.6404102Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:26:32.6404260Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:26:32.6404359Z #define _IO_funlockfile(_fp) 
2025-05-07T20:26:32.6404706Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:32.6404838Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:26:32.6404929Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:26:32.6405021Z #define __FLT_RADIX__ 2
2025-05-07T20:26:32.6405122Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:26:32.6405293Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:26:32.6405392Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:26:32.6405488Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:26:32.6405597Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:26:32.6405695Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:26:32.6405801Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:26:32.6405909Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:26:32.6405991Z #define WORD_BIT 32
2025-05-07T20:26:32.6406075Z #define _IO_USER_BUF 1
2025-05-07T20:26:32.6406174Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:26:32.6406275Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:32.6406382Z #define cudaHostAllocPortable 0x01
2025-05-07T20:26:32.6406489Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:26:32.6406589Z #define __long_double_t long double
2025-05-07T20:26:32.6406681Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:26:32.6406778Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:26:32.6407201Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:26:32.6407290Z #define __k8 1
2025-05-07T20:26:32.6407491Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:26:32.6407896Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:26:32.6408037Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:26:32.6408137Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:26:32.6408235Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:26:32.6408339Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:26:32.6408433Z #define __blksize_t_defined 
2025-05-07T20:26:32.6408527Z #define _IO_SHOWPOINT 0400
2025-05-07T20:26:32.6408630Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:26:32.6408741Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:26:32.6408843Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:26:32.6408949Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:26:32.6409043Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:26:32.6409146Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:26:32.6409415Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:26:32.6409779Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:26:32.6409984Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:26:32.6410084Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:26:32.6410166Z #define SEEK_SET 0
2025-05-07T20:26:32.6410271Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:26:32.6410365Z #define __CUDA_API_VER_MINOR__ 8
2025-05-07T20:26:32.6410573Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:26:32.6410674Z #define __cudaCDP2GetLastError 
2025-05-07T20:26:32.6410769Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:26:32.6410869Z #define _MATH_H_MATHDEF 1
2025-05-07T20:26:32.6411210Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:26:32.6411309Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:26:32.6411565Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:26:32.6411655Z #define __stub_sigreturn 
2025-05-07T20:26:32.6411911Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:26:32.6412100Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:26:32.6412188Z #define __HOST_CONFIG_H__ 
2025-05-07T20:26:32.6412296Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:26:32.6412382Z #define CLOCK_TAI 11
2025-05-07T20:26:32.6412489Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:26:32.6412709Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:26:32.6412798Z #define __restrict_arr 
2025-05-07T20:26:32.6412908Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:26:32.6413056Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:26:32.6413626Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:32.6413821Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:26:32.6413911Z #define __USE_MISC 1
2025-05-07T20:26:32.6414028Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:26:32.6414134Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:26:32.6414224Z #define _GCC_LIMITS_H_ 
2025-05-07T20:26:32.6414311Z #define __LDBL_DIG__ 18
2025-05-07T20:26:32.6414414Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:26:32.6414512Z #define __malloc_and_calloc_defined 
2025-05-07T20:26:32.6414605Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:26:32.6414715Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:26:32.6414796Z #define __x86_64__ 1
2025-05-07T20:26:32.6414877Z #define _SIZE_T_ 
2025-05-07T20:26:32.6415879Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:26:32.6415987Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:26:32.6416096Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:26:32.6416210Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:26:32.6416326Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:26:32.6416426Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:26:32.6416534Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:26:32.6416660Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:26:32.6416799Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:26:32.6416896Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:26:32.6417400Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:32.6417522Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:26:32.6417668Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:26:32.6417774Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:26:32.6417983Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:26:32.6418069Z #define STA_FLL 0x0008
2025-05-07T20:26:32.6418218Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:26:32.6418313Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:26:32.6418439Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:32.6418548Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:26:32.6418633Z #define __stub_revoke 
2025-05-07T20:26:32.6418731Z #define __timer_t_defined 1
2025-05-07T20:26:32.6418863Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:26:32.6418952Z #define INT_MAX __INT_MAX__
2025-05-07T20:26:32.6419066Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:26:32.6419171Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:26:32.6419266Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:26:32.6419374Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:26:32.6419484Z #define cudaArrayTextureGather 0x08
2025-05-07T20:26:32.6419593Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:26:32.6419819Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:26:32.6419913Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:26:32.6420007Z #define _IO_off_t __off_t
2025-05-07T20:26:32.6420093Z #define __FLT64_DIG__ 15
2025-05-07T20:26:32.6420321Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:26:32.6420424Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:26:32.6420551Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:32.6420672Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:26:32.6420773Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:26:32.6420877Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:26:32.6420960Z #define NULL __null
2025-05-07T20:26:32.6421095Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:26:32.6421197Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:26:32.6421301Z #define __U64_TYPE unsigned long int
2025-05-07T20:26:32.6421394Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:26:32.6421497Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:26:32.6421589Z #define FP_ZERO 2
2025-05-07T20:26:32.6421684Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:26:32.6421838Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:26:32.6421951Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:32.6422031Z #define __WCHAR_T__ 
2025-05-07T20:26:32.6422126Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:26:32.6422358Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:26:32.6422511Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:26:32.6422616Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:26:32.6422734Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:26:32.6436321Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:26:32.6436565Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:26:32.6436734Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:26:32.6436889Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:26:32.6437022Z #define _SIGSET_H_types 1
2025-05-07T20:26:32.6437184Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:26:32.6437335Z #define __cpp_unicode_literals 200710L
2025-05-07T20:26:32.6437533Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:26:32.6437635Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:26:32.6437763Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:26:32.6437894Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:26:32.6438009Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:26:32.6438139Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:26:32.6438248Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1
2025-05-07T20:26:32.6438430Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:26:32.6438523Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:26:32.6438626Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:26:32.6438728Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:26:32.6439003Z #define STA_MODE 0x4000
2025-05-07T20:26:32.6439114Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:26:32.6439218Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:26:32.6439333Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:26:32.6439432Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:26:32.6439532Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:26:32.6439643Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:26:32.6439749Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:26:32.6439860Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:26:32.6439946Z #define __SIZE_WIDTH__ 64
2025-05-07T20:26:32.6440399Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:32.6440483Z #define __SEG_FS 1
2025-05-07T20:26:32.6440572Z #define _IO_size_t size_t
2025-05-07T20:26:32.6440676Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:26:32.6440773Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:26:32.6440856Z #define __stub_lchmod 
2025-05-07T20:26:32.6441054Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:26:32.6441162Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:32.6441257Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:26:32.6441347Z #define __SEG_GS 1
2025-05-07T20:26:32.6441536Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:26:32.6441630Z #define _IOS_APPEND 8
2025-05-07T20:26:32.6441722Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:26:32.6441811Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:26:32.6441913Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:26:32.6442010Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:26:32.6442109Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:26:32.6442198Z #define htole16(x) (x)
2025-05-07T20:26:32.6442304Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:32.6442395Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:26:32.6442494Z #define __INT16_TYPE__ short int
2025-05-07T20:26:32.6442592Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:26:32.6442716Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:26:32.6442825Z #define __cpp_structured_bindings 201606L
2025-05-07T20:26:32.6442953Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:26:32.6443049Z #define __SIZEOF_INT__ 4
2025-05-07T20:26:32.6443138Z #define __WCLONE 0x80000000
2025-05-07T20:26:32.6443230Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:26:32.6443317Z #define SEEK_HOLE 4
2025-05-07T20:26:32.6443402Z #define TIMER_ABSTIME 1
2025-05-07T20:26:32.6443495Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:26:32.6443588Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:26:32.6443767Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:32.6443878Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:32.6444114Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:26:32.6444225Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:26:32.6444327Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:26:32.6444447Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:26:32.6444545Z #define _LINUX_LIMITS_H 
2025-05-07T20:26:32.6444631Z #define linux 1
2025-05-07T20:26:32.6444721Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:26:32.6444829Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:26:32.6444936Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:26:32.6445026Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:26:32.6445131Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:26:32.6445281Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:26:32.6445377Z #define __cpp_lib_hypot 201603
2025-05-07T20:26:32.6445470Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:26:32.6445571Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:26:32.6445658Z #define MOD_NANO ADJ_NANO
2025-05-07T20:26:32.6445747Z #define htole64(x) (x)
2025-05-07T20:26:32.6445846Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:26:32.6445978Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:26:32.6446115Z #define _IO_UPPERCASE 01000
2025-05-07T20:26:32.6446827Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:26:32.6446923Z #define __USE_POSIX2 1
2025-05-07T20:26:32.6447031Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:26:32.6447116Z #define __WALL 0x40000000
2025-05-07T20:26:32.6447212Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:26:32.6447309Z #define _XLOCALE_H 1
2025-05-07T20:26:32.6447442Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:26:32.6447579Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:26:32.6447675Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:26:32.6447779Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:26:32.6447873Z #define __EXCEPTIONS 1
2025-05-07T20:26:32.6447974Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:26:32.6448174Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:26:32.6448267Z #define __WORDSIZE 64
2025-05-07T20:26:32.6448359Z #define CLOCK_MONOTONIC 1
2025-05-07T20:26:32.6448448Z #define _STL_RELOPS_H 1
2025-05-07T20:26:32.6448677Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:26:32.6448776Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:26:32.6448876Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:26:32.6448974Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:26:32.6449072Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:26:32.6449390Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:26:32.6449628Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:32.6449765Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:26:32.6449871Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:26:32.6449971Z #define __cpp_range_based_for 201603L
2025-05-07T20:26:32.6450081Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:26:32.6450190Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:26:32.6450299Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:26:32.6450483Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:26:32.6450595Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:26:32.6450687Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:26:32.6450796Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:26:32.6450976Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:32.6451091Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:26:32.6451181Z #define _STRING_H 1
2025-05-07T20:26:32.6451280Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:26:32.6451370Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:26:32.6451473Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:26:32.6451606Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:26:32.6451701Z #define __code_model_small__ 1
2025-05-07T20:26:32.6451793Z #define _PSTL_CONFIG_H 
2025-05-07T20:26:32.6451893Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:26:32.6452009Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:26:32.6452104Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:26:32.6452211Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:26:32.6452574Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:32.6452668Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:26:32.6452754Z #define le64toh(x) (x)
2025-05-07T20:26:32.6452851Z #define FILENAME_MAX 4096
2025-05-07T20:26:32.6453004Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:26:32.6453119Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:26:32.6453211Z #define L_cuserid 9
2025-05-07T20:26:32.6453298Z #define __ino_t_defined 
2025-05-07T20:26:32.6453379Z #define __k8__ 1
2025-05-07T20:26:32.6453482Z #define __INTPTR_TYPE__ long int
2025-05-07T20:26:32.6453591Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:26:32.6453683Z #define __int8_t_defined 
2025-05-07T20:26:32.6453773Z #define __WCHAR_TYPE__ int
2025-05-07T20:26:32.6453871Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:26:32.6453989Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:26:32.6454177Z #define __SLONGWORD_TYPE long int
2025-05-07T20:26:32.6454297Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:26:32.6454454Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:26:32.6454538Z #define __HAVE_COLUMN 
2025-05-07T20:26:32.6454622Z #define __stub_fdetach 
2025-05-07T20:26:32.6455064Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:26:32.6455144Z #define __pic__ 2
2025-05-07T20:26:32.6455272Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:32.6455368Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:26:32.6455462Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:26:32.6455570Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:26:32.6455654Z #define __stub_chflags 
2025-05-07T20:26:32.6455743Z #define CLOCK_BOOTTIME 7
2025-05-07T20:26:32.6455836Z #define __need_IOV_MAX 
2025-05-07T20:26:32.6456026Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:26:32.6456132Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:26:32.6456236Z #define __cpp_decltype 200707L
2025-05-07T20:26:32.6456332Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:26:32.6456423Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:26:32.6456535Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:26:32.6456621Z #define TTY_NAME_MAX 32
2025-05-07T20:26:32.6456796Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:26:32.6456917Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:32.6457086Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:26:32.6457199Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:26:32.6457291Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:26:32.6457384Z #define STA_PPSTIME 0x0004
2025-05-07T20:26:32.6457473Z #define __import__ 
2025-05-07T20:26:32.6457563Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:26:32.6457703Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:26:32.6457799Z #define __export__ 
2025-05-07T20:26:32.6457918Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:26:32.6458025Z #define cudaMemAttachHost 0x02
2025-05-07T20:26:32.6458190Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:32.6458285Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:26:32.6458376Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:26:32.6458470Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:26:32.6458560Z #define _WCHAR_T_DECLARED 
2025-05-07T20:26:32.6458687Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:26:32.6458804Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:26:32.6458908Z #define __cpp_inline_variables 201606L
2025-05-07T20:26:32.6459004Z #define WNOWAIT 0x01000000
2025-05-07T20:26:32.6459087Z #define PLOSS 6
2025-05-07T20:26:32.6459181Z #define M_LN10 2.30258509299404568402
2025-05-07T20:26:32.6459471Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:26:32.6459568Z #define EXIT_SUCCESS 0
2025-05-07T20:26:32.6459671Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:26:32.6459766Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:26:32.6459868Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:26:32.6459963Z #define __thread__ __thread
2025-05-07T20:26:32.6460057Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:26:32.6460150Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:26:32.6460258Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:26:32.6460491Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:32.6460600Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:26:32.6460701Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:26:32.6460784Z #define __linux__ 1
2025-05-07T20:26:32.6460886Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:26:32.6461013Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:26:32.6461108Z #define __S16_TYPE short int
2025-05-07T20:26:32.6461578Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:26:32.6461688Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:26:32.6461882Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:26:32.6461988Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:26:32.6462086Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:26:32.6462167Z #define _T_SIZE_ 
2025-05-07T20:26:32.6462270Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:32.6462388Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:26:32.6462488Z #define _PSTL_VERSION 12000
2025-05-07T20:26:32.6462608Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:26:32.6462702Z #define __WNOTHREAD 0x20000000
2025-05-07T20:26:32.6462804Z #define _G_va_list __gnuc_va_list
2025-05-07T20:26:32.6462933Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:26:32.6463018Z #define _IOS_INPUT 1
2025-05-07T20:26:32.6463114Z #define __USE_LARGEFILE64 1
2025-05-07T20:26:32.6463302Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:26:32.6463394Z #define __INT64_TYPE__ long int
2025-05-07T20:26:32.6463499Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:26:32.6463598Z #define __shared__ __location__(shared)
2025-05-07T20:26:32.6463688Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:26:32.6463850Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:26:32.6463936Z #define __gid_t_defined 
2025-05-07T20:26:32.6464053Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:26:32.6464148Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:26:32.6464350Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:26:32.6464450Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:26:32.6464540Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:26:32.6464627Z #define ___int_size_t_h 
2025-05-07T20:26:32.6464739Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:32.6464860Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:26:32.6465026Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:26:32.6465130Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:26:32.6465224Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:26:32.6465329Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:26:32.6465420Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:26:32.6465544Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:32.6465661Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:26:32.6465778Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:26:32.6465870Z #define __clock_t_defined 1
2025-05-07T20:26:32.6465974Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:26:32.6466081Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:26:32.6466169Z #define __GLIBC_MINOR__ 17
2025-05-07T20:26:32.6466266Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:26:32.6466362Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:26:32.6466473Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:26:32.6466578Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:26:32.6466782Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:32.6466880Z #define __SSE__ 1
2025-05-07T20:26:32.6466976Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:26:32.6467068Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:26:32.6467157Z #define _CTYPE_H 1
2025-05-07T20:26:32.6467249Z #define __sigset_t_defined 
2025-05-07T20:26:32.6467343Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:26:32.6467445Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:26:32.6467531Z #define MOD_TAI ADJ_TAI
2025-05-07T20:26:32.6467627Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:26:32.6467724Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:26:32.6467806Z #define __SM_70_RT_H__ 
2025-05-07T20:26:32.6467904Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:26:32.6468006Z #define cudaEventWaitDefault 0x00
2025-05-07T20:26:32.6468101Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:26:32.6468354Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:32.6468454Z #define _POSIX_MAX_CANON 255
2025-05-07T20:26:32.6468564Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:26:32.6468662Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:26:32.6468753Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:26:32.6468835Z #define __amd64__ 1
2025-05-07T20:26:32.6468926Z #define __WINT_WIDTH__ 32
2025-05-07T20:26:32.6469028Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:26:32.6469306Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:32.6469410Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:26:32.6469489Z #define EOF (-1)
2025-05-07T20:26:32.6469594Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:26:32.6469828Z #define __USE_POSIX199309 1
2025-05-07T20:26:32.6469927Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:26:32.6470024Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:26:32.6470116Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:26:32.6470211Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:26:32.6470424Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:26:32.6470516Z #define ____mbstate_t_defined 1
2025-05-07T20:26:32.6470601Z #define STA_NANO 0x2000
2025-05-07T20:26:32.6470701Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:26:32.6470792Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:26:32.6470877Z #define _IO_LINKED 0x80
2025-05-07T20:26:32.6470979Z #define __cpp_lib_launder 201606
2025-05-07T20:26:32.6471069Z #define __SIZEOF_INT128__ 16
2025-05-07T20:26:32.6471177Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:26:32.6471268Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:26:32.6471362Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:26:32.6471510Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:26:32.6471614Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:32.6471713Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:32.6471822Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:26:32.6471914Z #define __W_CONTINUED 0xffff
2025-05-07T20:26:32.6472013Z #define __ATOMIC_RELAXED 0
2025-05-07T20:26:32.6472149Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:26:32.6472269Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:32.6472480Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:26:32.6472671Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:26:32.6472755Z #define __stub_stty 
2025-05-07T20:26:32.6472932Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:26:32.6473016Z #define le16toh(x) (x)
2025-05-07T20:26:32.6473122Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:26:32.6473309Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:26:32.6473390Z #define _SIZET_ 
2025-05-07T20:26:32.6473480Z #define XATTR_NAME_MAX 255
2025-05-07T20:26:32.6473566Z #define _SVID_SOURCE 1
2025-05-07T20:26:32.6473646Z #define _LP64 1
2025-05-07T20:26:32.6473734Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:26:32.6473994Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:26:32.6474104Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:26:32.6474193Z #define __UINT8_C(c) c
2025-05-07T20:26:32.6474285Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:26:32.6474376Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:26:32.6474488Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:26:32.6474580Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:26:32.6474673Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:26:32.6474773Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:26:32.6474855Z #define CUDARTAPI 
2025-05-07T20:26:32.6474935Z #define IOV_MAX 1024
2025-05-07T20:26:32.6475082Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:26:32.6475176Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:26:32.6475281Z #define P_tmpdir "/tmp"
2025-05-07T20:26:32.6475382Z #define cudaMemAttachSingle 0x04
2025-05-07T20:26:32.6475463Z #define __wchar_t__ 
2025-05-07T20:26:32.6475689Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:26:32.6475772Z #define SEEK_END 2
2025-05-07T20:26:32.6475861Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:26:32.6476041Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:26:32.6476137Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:26:32.6476281Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:26:32.6476375Z #define ____FILE_defined 1
2025-05-07T20:26:32.6476488Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:26:32.6476586Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:26:32.6476950Z #define _ISOC99_SOURCE 1
2025-05-07T20:26:32.6477047Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:26:32.6477470Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:32.6477600Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:26:32.6477681Z #define _IO_RIGHT 04
2025-05-07T20:26:32.6477782Z #define __END_NAMESPACE_STD 
2025-05-07T20:26:32.6478072Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:32.6478164Z #define _GLIBCXX_STD_C std
2025-05-07T20:26:32.6478285Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:26:32.6478378Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:26:32.6478476Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:26:32.6478570Z #define _STDDEF_H_ 
2025-05-07T20:26:32.6478748Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:32.6478849Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:26:32.6478966Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:26:32.6479170Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:26:32.6479283Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:32.6479422Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:26:32.6479540Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:26:32.6479645Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:26:32.6479762Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:26:32.6479857Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:26:32.6479977Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:26:32.6480071Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:26:32.6480165Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:26:32.6480260Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:26:32.6480439Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:26:32.6480535Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:26:32.6480718Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:26:32.6480815Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:26:32.6480913Z #define __STDCPP_THREADS__ 1
2025-05-07T20:26:32.6481056Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:26:32.6481150Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:26:32.6481250Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:26:32.6481347Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:26:32.6481475Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:26:32.6481572Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:26:32.6481671Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:26:32.6481861Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:26:32.6482036Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:26:32.6482136Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:26:32.6482264Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:26:32.6482373Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:26:32.6482473Z #define __location__(a) __annotate__(a)
2025-05-07T20:26:32.6482722Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:26:32.6483132Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:26:32.6483285Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:26:32.6483389Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:26:32.6483478Z #define __STDC_UTF_32__ 1
2025-05-07T20:26:32.6483819Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:26:32.6483918Z #define NAN (__builtin_nanf (""))
2025-05-07T20:26:32.6484012Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:26:32.6484096Z #define __FXSR__ 1
2025-05-07T20:26:32.6484174Z #define _SIZE_T 
2025-05-07T20:26:32.6484277Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:26:32.6484395Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:26:32.6484568Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:32.6484721Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:26:32.6484821Z #define _IO_ssize_t __ssize_t
2025-05-07T20:26:32.6484917Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:26:32.6485112Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:32.6485318Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:26:32.6485407Z #define _GXX_NULLPTR_T 
2025-05-07T20:26:32.6485537Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:26:32.6485758Z #define FOPEN_MAX 16
2025-05-07T20:26:32.6485846Z #define __BIG_ENDIAN 4321
2025-05-07T20:26:32.6485973Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:32.6486069Z #define __suseconds_t_defined 
2025-05-07T20:26:32.6486154Z #define __off_t_defined 
2025-05-07T20:26:32.6486245Z #define stderr stderr
2025-05-07T20:26:32.6486341Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:26:32.6486452Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:26:32.6486562Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:26:32.6486656Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:26:32.6487099Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:26:32.6487188Z #define __mode_t_defined 
2025-05-07T20:26:32.6487269Z #define _GCC_SIZE_T 
2025-05-07T20:26:32.6487370Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:32.6487471Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:26:32.6487592Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:26:32.6487690Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:26:32.6487844Z #define __UINT32_C(c) c ## U
2025-05-07T20:26:32.6487978Z #define __cpp_alias_templates 200704L
2025-05-07T20:26:32.6488151Z #define cudaHostAllocMapped 0x02
2025-05-07T20:26:32.6488315Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:26:32.6488642Z #define _STL_ITERATOR_H 1
2025-05-07T20:26:32.6488910Z #define __size_t__ 
2025-05-07T20:26:32.6489073Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:26:32.6489201Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:26:32.6489394Z #define cudaEventRecordExternal 0x01
2025-05-07T20:26:32.6489562Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:26:32.6489804Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:26:32.6490008Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:26:32.6490122Z #define _ENDIAN_H 1
2025-05-07T20:26:32.6490303Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:26:32.6490450Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:26:32.6490567Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:26:32.6490775Z #define __try try
2025-05-07T20:26:32.6490901Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:26:32.6491026Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:26:32.6491203Z #define __INT8_MAX__ 0x7f
2025-05-07T20:26:32.6491506Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:26:32.6491692Z #define __LONG_WIDTH__ 64
2025-05-07T20:26:32.6491819Z #define __PIC__ 2
2025-05-07T20:26:32.6491960Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:26:32.6492145Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:26:32.6492330Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:26:32.6492455Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:26:32.6492644Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:26:32.6493049Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:32.6493188Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:26:32.6493372Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:26:32.6493490Z #define _IO_uid_t __uid_t
2025-05-07T20:26:32.6493636Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:26:32.6493843Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:26:32.6493978Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:26:32.6494214Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:32.6494349Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:26:32.6494500Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:26:32.6494632Z #define LONG_BIT 64
2025-05-07T20:26:32.6494813Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:26:32.6494956Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:26:32.6495170Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:26:32.6495291Z #define __fsfilcnt_t_defined 
2025-05-07T20:26:32.6495449Z #define __blkcnt_t_defined 
2025-05-07T20:26:32.6495831Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:32.6495996Z #define __USE_LARGEFILE 1
2025-05-07T20:26:32.6496204Z #define __cpp_constexpr 201603L
2025-05-07T20:26:32.6496326Z #define CUDART_VERSION 12080
2025-05-07T20:26:32.6496443Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:26:32.6496609Z #define cudaDeviceMapHost 0x08
2025-05-07T20:26:32.6496710Z #define _GLIBCXX_CMATH 1
2025-05-07T20:26:32.6497151Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:26:32.6497271Z #define __lldiv_t_defined 1
2025-05-07T20:26:32.6497381Z #define __SSE2__ 1
2025-05-07T20:26:32.6497525Z #define _IOLBF 1
2025-05-07T20:26:32.6497655Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:26:32.6497765Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:26:32.6498019Z #define __cpp_deduction_guides 201703L
2025-05-07T20:26:32.6498144Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:26:32.6498281Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:26:32.6498447Z #define __INT32_TYPE__ int
2025-05-07T20:26:32.6498569Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:26:32.6498688Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:26:32.6498940Z #define __cpp_exceptions 199711L
2025-05-07T20:26:32.6499064Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:26:32.6499240Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:26:32.6499363Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:26:32.6499510Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:26:32.6499795Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:26:32.6499936Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:26:32.6500063Z #define __SWORD_TYPE long int
2025-05-07T20:26:32.6500226Z #define __INTMAX_TYPE__ long int
2025-05-07T20:26:32.6500352Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:26:32.6500496Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:26:32.6500770Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:26:32.6501119Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:32.6501282Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:26:32.6501458Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:26:32.6501571Z #define _T_SIZE 
2025-05-07T20:26:32.6501749Z #define cudaHostAllocDefault 0x00
2025-05-07T20:26:32.6501958Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:32.6502127Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:26:32.6502285Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:26:32.6502404Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:26:32.6502614Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:26:32.6502726Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:32.6502908Z #define __ATOMIC_CONSUME 1
2025-05-07T20:26:32.6503172Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:26:32.6503287Z #define __GNUC_MINOR__ 4
2025-05-07T20:26:32.6503418Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:26:32.6503601Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:26:32.6503825Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:32.6504340Z #define __PIE__ 2
2025-05-07T20:26:32.6504781Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:26:32.6504938Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:26:32.6505282Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:26:32.6505543Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:32.6505654Z #define __nlink_t_defined 
2025-05-07T20:26:32.6505917Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:26:32.6506066Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:26:32.6506205Z #define _XOPEN_LIM_H 1
2025-05-07T20:26:32.6506545Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:32.6506694Z #define __cpp_template_template_args 201611L
2025-05-07T20:26:32.6506895Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:26:32.6507159Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:26:32.6507286Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:26:32.6507436Z #define __FILE_defined 1
2025-05-07T20:26:32.6507651Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:26:32.6507777Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:26:32.6507966Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:26:32.6508142Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:26:32.6508290Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:26:32.6508460Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:26:32.6508590Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:26:32.6508722Z #define __INT16_C(c) c
2025-05-07T20:26:32.6509052Z #define __U32_TYPE unsigned int
2025-05-07T20:26:32.6509334Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:26:32.6509525Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:26:32.6509831Z #define __STDC__ 1
2025-05-07T20:26:32.6509986Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:26:32.6510147Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:26:32.6510351Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:26:32.6510548Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:26:32.6510697Z #define __FLT32X_DIG__ 15
2025-05-07T20:26:32.6510826Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:26:32.6510986Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:26:32.6511111Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:26:32.6511318Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:26:32.6511493Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:26:32.6511624Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:26:32.6511737Z #define stdin stdin
2025-05-07T20:26:32.6511889Z #define __ino64_t_defined 
2025-05-07T20:26:32.6512010Z #define STA_CLK 0x8000
2025-05-07T20:26:32.6512177Z #define __clockid_t_defined 1
2025-05-07T20:26:32.6512401Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:26:32.6512603Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:26:32.6512773Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:26:32.6512925Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:26:32.6513044Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:26:32.6513268Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:26:32.6513501Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:26:32.6513695Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:26:32.6514337Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:26:32.6514471Z #define DOMAIN 1
2025-05-07T20:26:32.6514677Z #define M_LN2 0.69314718055994530942
2025-05-07T20:26:32.6514804Z #define __NVCC__ 1
2025-05-07T20:26:32.6514935Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:26:32.6515108Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:32.6515366Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:26:32.6515500Z #define __throw_exception_again throw
2025-05-07T20:26:32.6515696Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:26:32.6515828Z #define __EXCEPTION_H 1
2025-05-07T20:26:32.6515988Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:26:32.6516143Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:26:32.6516488Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:32.6516645Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:26:32.6516824Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:26:32.6516961Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:26:32.6517149Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:26:32.6517277Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:26:32.6517448Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:26:32.6517605Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:32.6517881Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:26:32.6518147Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:26:32.6518279Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:26:32.6518405Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:26:32.6518570Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:26:32.6518718Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:26:32.6518898Z #define __useconds_t_defined 
2025-05-07T20:26:32.6519096Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:26:32.6519312Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:26:32.6519522Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:26:32.6519637Z #define __SSE_MATH__ 1
2025-05-07T20:26:32.6519740Z #define _IO_wint_t wint_t
2025-05-07T20:26:32.6519987Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:26:32.6520108Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:26:32.6520230Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:26:32.6520412Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:26:32.6520542Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:26:32.6520646Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:26:32.6520890Z #define __USE_ATFILE 1
2025-05-07T20:26:32.6521013Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:26:32.6521136Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:26:32.6521287Z #define _GCC_PTRDIFF_T 
2025-05-07T20:26:32.6521547Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:32.6521760Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:26:32.6521971Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:26:32.6522102Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:26:32.6522278Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:26:32.6522390Z #define _STDLIB_H 1
2025-05-07T20:26:32.6522561Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:26:32.6522775Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:26:32.6522911Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:26:32.6523107Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:32.6523250Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:32.6523371Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:26:32.6523630Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:26:32.6523862Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:26:32.6524009Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:26:32.6524190Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:26:32.6524309Z #define __ldiv_t_defined 1
2025-05-07T20:26:32.6524520Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:26:32.6524682Z #define ___int_ptrdiff_t_h 
2025-05-07T20:26:32.6524936Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:32.6525117Z #define __cudaCDP2EventDestroy 
2025-05-07T20:26:32.6525236Z #define __HOST_DEFINES_H__ 
2025-05-07T20:26:32.6525366Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:26:32.6525635Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:32.6525828Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:26:32.6525987Z #define CUDART_CB 
2025-05-07T20:26:32.6526171Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:26:32.6526327Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:26:32.6526465Z #define MB_LEN_MAX 16
2025-05-07T20:26:32.6526765Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:32.6526877Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:26:32.6527124Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:26:32.6527266Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:26:32.6527393Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:26:32.6527634Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:26:32.6527773Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:26:32.6527870Z #define _GNU_SOURCE 1
2025-05-07T20:26:32.6528075Z #define __stub_putmsg 
2025-05-07T20:26:32.6528186Z #define __CUDACC__ 1
2025-05-07T20:26:32.6528455Z #define __N(msgid) (msgid)
2025-05-07T20:26:32.6528568Z #define __P(args) args
2025-05-07T20:26:32.6528860Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:26:32.6529061Z #define __cpp_init_captures 201304L
2025-05-07T20:26:32.6529209Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:26:32.6529354Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:26:32.6529521Z #define __cpp_lib_as_const 201510
2025-05-07T20:26:32.6529701Z #define __WCHAR_T 
2025-05-07T20:26:32.6529822Z #define __ATOMIC_RELEASE 3
2025-05-07T20:26:32.6530007Z #define __fsblkcnt_t_defined 
2025-05-07T20:26:32.6530189Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:26:32.6530355Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:26:32.6530362Z 
2025-05-07T20:26:32.6724041Z 
2025-05-07T20:26:32.6724852Z + conda run -n build_binary nvcc --version
2025-05-07T20:26:32.6724876Z 
2025-05-07T20:26:34.5613766Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:26:34.5614431Z Copyright (c) 2005-2025 NVIDIA Corporation
2025-05-07T20:26:34.5614920Z Built on Wed_Jan_15_19:20:09_PST_2025
2025-05-07T20:26:34.5615369Z Cuda compilation tools, release 12.8, V12.8.61
2025-05-07T20:26:34.5615785Z Build cuda_12.8.r12.8/compiler.35404655_0
2025-05-07T20:26:34.5616006Z 
2025-05-07T20:26:34.6257376Z 
2025-05-07T20:26:34.6270313Z /usr/bin/nvidia-smi
2025-05-07T20:26:34.6273901Z + nvidia-smi
2025-05-07T20:26:34.6274106Z 
2025-05-07T20:26:34.6445858Z Wed May  7 20:26:34 2025       
2025-05-07T20:26:34.6446615Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:34.6447237Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:26:34.6447827Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:34.6448482Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:26:34.6449177Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:26:34.6449727Z |                                         |                        |               MIG M. |
2025-05-07T20:26:34.6450294Z |=========================================+========================+======================|
2025-05-07T20:26:34.6613657Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:26:34.6615088Z |  0%   26C    P8             19W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:26:34.6616290Z |                                         |                        |                  N/A |
2025-05-07T20:26:34.6617158Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:34.6617874Z                                                                                          
2025-05-07T20:26:34.6618931Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:34.6619487Z | Processes:                                                                              |
2025-05-07T20:26:34.6619987Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:26:34.6620597Z |        ID   ID                                                               Usage      |
2025-05-07T20:26:34.6621043Z |=========================================================================================|
2025-05-07T20:26:34.6634384Z |  No running processes found                                                             |
2025-05-07T20:26:34.6635261Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:34.9398529Z 
2025-05-07T20:26:34.9402626Z [INSTALL] Successfully installed CUDA 12.8.0
2025-05-07T20:26:34.9452094Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0
2025-05-07T20:26:34.9452657Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0[0m
2025-05-07T20:26:34.9464414Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:26:34.9464768Z env:
2025-05-07T20:26:34.9464984Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:26:34.9465280Z   BUILD_ENV: build_binary
2025-05-07T20:26:34.9465523Z   BUILD_TARGET: genai
2025-05-07T20:26:34.9465745Z   BUILD_VARIANT: cuda
2025-05-07T20:26:34.9465968Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:26:34.9466221Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:26:34.9466520Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:26:34.9466854Z ##[endgroup]
2025-05-07T20:26:35.2825519Z ################################################################################
2025-05-07T20:26:35.2826054Z # Install PyTorch (PIP)
2025-05-07T20:26:35.2826373Z #
2025-05-07T20:26:35.2841088Z # [2025-05-07T20:26:35.283Z] + install_pytorch_pip build_binary nightly cuda/12.8.0
2025-05-07T20:26:35.2841745Z ################################################################################
2025-05-07T20:26:35.2842080Z 
2025-05-07T20:26:35.2871738Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:26:36.2827473Z Channels:
2025-05-07T20:26:36.2827780Z  - conda-forge
2025-05-07T20:26:36.2828087Z Platform: linux-64
2025-05-07T20:26:39.5817421Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:26:40.3057357Z Solving environment: \ | / done
2025-05-07T20:26:40.5244126Z 
2025-05-07T20:26:40.5244282Z ## Package Plan ##
2025-05-07T20:26:40.5244432Z 
2025-05-07T20:26:40.5244642Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:40.5244961Z 
2025-05-07T20:26:40.5245117Z   added / updated specs:
2025-05-07T20:26:40.5245473Z     - numpy
2025-05-07T20:26:40.5245629Z 
2025-05-07T20:26:40.5245655Z 
2025-05-07T20:26:40.5245809Z The following packages will be downloaded:
2025-05-07T20:26:40.5246096Z 
2025-05-07T20:26:40.5246246Z     package                    |            build
2025-05-07T20:26:40.5246671Z     ---------------------------|-----------------
2025-05-07T20:26:40.5247072Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:26:40.5247544Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:26:40.5248011Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:26:40.5248482Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:26:40.5248957Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:26:40.5249451Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:26:40.5249919Z     numpy-2.0.2                |   py39h9cb892a_1         7.6 MB  conda-forge
2025-05-07T20:26:40.5250314Z     ------------------------------------------------------------
2025-05-07T20:26:40.5250665Z                                            Total:        14.8 MB
2025-05-07T20:26:40.5250894Z 
2025-05-07T20:26:40.5251020Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:40.5251244Z 
2025-05-07T20:26:40.5251474Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:26:40.5251992Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:26:40.5252517Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:26:40.5253045Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:26:40.5253587Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:26:40.5254151Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:26:40.5254946Z   numpy              conda-forge/linux-64::numpy-2.0.2-py39h9cb892a_1 
2025-05-07T20:26:40.5255275Z 
2025-05-07T20:26:40.5255279Z 
2025-05-07T20:26:40.5255283Z 
2025-05-07T20:26:40.5255574Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:40.5255962Z numpy-2.0.2          | 7.6 MB    |            |   0% 
2025-05-07T20:26:40.5256191Z 
2025-05-07T20:26:40.5256458Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:40.5256718Z 
2025-05-07T20:26:40.5256929Z 
2025-05-07T20:26:40.5280872Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:26:40.5281305Z 
2025-05-07T20:26:40.5281311Z 
2025-05-07T20:26:40.5281316Z 
2025-05-07T20:26:40.5284491Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:26:40.5284894Z 
2025-05-07T20:26:40.5284903Z 
2025-05-07T20:26:40.5284908Z 
2025-05-07T20:26:40.5288570Z 
2025-05-07T20:26:40.5301809Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:26:40.5302095Z 
2025-05-07T20:26:40.5302100Z 
2025-05-07T20:26:40.5302133Z 
2025-05-07T20:26:40.5302138Z 
2025-05-07T20:26:40.5302143Z 
2025-05-07T20:26:40.5303144Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:26:40.5303434Z 
2025-05-07T20:26:40.5303439Z 
2025-05-07T20:26:40.5303446Z 
2025-05-07T20:26:40.5303450Z 
2025-05-07T20:26:40.5303455Z 
2025-05-07T20:26:40.5303460Z 
2025-05-07T20:26:40.5914337Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:40.5914656Z 
2025-05-07T20:26:40.5914660Z 
2025-05-07T20:26:40.5914664Z 
2025-05-07T20:26:40.5917165Z 
2025-05-07T20:26:40.6746410Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:40.6746706Z 
2025-05-07T20:26:40.6746711Z 
2025-05-07T20:26:40.6781146Z 
2025-05-07T20:26:40.6784145Z libgfortran-15.1.0   | 34 KB     | ####7      |  47% [A[A[A
2025-05-07T20:26:40.6784454Z 
2025-05-07T20:26:40.6784460Z 
2025-05-07T20:26:40.6784466Z 
2025-05-07T20:26:40.6784472Z 
2025-05-07T20:26:40.6784480Z 
2025-05-07T20:26:40.6833319Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:26:40.6833605Z 
2025-05-07T20:26:40.6836089Z 
2025-05-07T20:26:40.6836303Z 
2025-05-07T20:26:40.6836340Z 
2025-05-07T20:26:40.6836368Z 
2025-05-07T20:26:40.6892893Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:40.6893183Z 
2025-05-07T20:26:40.6893188Z 
2025-05-07T20:26:40.6896815Z 
2025-05-07T20:26:40.8127247Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:40.8127550Z 
2025-05-07T20:26:40.8127555Z 
2025-05-07T20:26:40.8127560Z 
2025-05-07T20:26:40.8127564Z 
2025-05-07T20:26:40.8127569Z 
2025-05-07T20:26:40.8127578Z 
2025-05-07T20:26:40.8156346Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:26:40.8157313Z 
2025-05-07T20:26:40.8186057Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:40.8186433Z 
2025-05-07T20:26:40.8186778Z 
2025-05-07T20:26:40.8186795Z 
2025-05-07T20:26:40.8186841Z 
2025-05-07T20:26:40.8186849Z 
2025-05-07T20:26:40.8188128Z 
2025-05-07T20:26:40.8631480Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:40.8631945Z 
2025-05-07T20:26:40.8631954Z 
2025-05-07T20:26:40.8631961Z 
2025-05-07T20:26:40.8631966Z 
2025-05-07T20:26:40.8635733Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:40.8636021Z 
2025-05-07T20:26:40.8636025Z 
2025-05-07T20:26:40.8636029Z 
2025-05-07T20:26:40.8637831Z 
2025-05-07T20:26:40.8722279Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:40.8755724Z numpy-2.0.2          | 7.6 MB    |            |   0% 
2025-05-07T20:26:40.8755978Z 
2025-05-07T20:26:40.8755982Z 
2025-05-07T20:26:40.8755986Z 
2025-05-07T20:26:40.8755990Z 
2025-05-07T20:26:40.8757869Z 
2025-05-07T20:26:40.8866861Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:40.8867154Z 
2025-05-07T20:26:40.8868831Z 
2025-05-07T20:26:40.9065007Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:26:40.9065518Z 
2025-05-07T20:26:40.9065525Z 
2025-05-07T20:26:40.9065530Z 
2025-05-07T20:26:40.9079995Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:40.9080542Z 
2025-05-07T20:26:40.9080547Z 
2025-05-07T20:26:40.9081140Z 
2025-05-07T20:26:40.9157094Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:40.9161397Z 
2025-05-07T20:26:40.9288154Z libopenblas-0.3.29   | 5.6 MB    | #####9     |  60% [A
2025-05-07T20:26:40.9502402Z 
2025-05-07T20:26:40.9502411Z 
2025-05-07T20:26:40.9502418Z 
2025-05-07T20:26:40.9502428Z 
2025-05-07T20:26:40.9502435Z 
2025-05-07T20:26:40.9502443Z 
2025-05-07T20:26:40.9502878Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:40.9503181Z 
2025-05-07T20:26:40.9503185Z 
2025-05-07T20:26:40.9722395Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:40.9917143Z numpy-2.0.2          | 7.6 MB    | ########6  |  86% 
2025-05-07T20:26:40.9920461Z 
2025-05-07T20:26:41.0050144Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:41.0490700Z numpy-2.0.2          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:41.0491039Z 
2025-05-07T20:26:41.0491043Z 
2025-05-07T20:26:41.0495199Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:41.0495492Z 
2025-05-07T20:26:41.0495778Z 
2025-05-07T20:26:41.1661026Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:41.1661355Z 
2025-05-07T20:26:41.1662488Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:41.1662861Z 
2025-05-07T20:26:41.4711284Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:41.4721392Z numpy-2.0.2          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:41.4721745Z                                                      
2025-05-07T20:26:41.4721953Z 
2025-05-07T20:26:41.4722150Z                                                      [A
2025-05-07T20:26:41.4722366Z 
2025-05-07T20:26:41.4722381Z 
2025-05-07T20:26:41.4722549Z                                                      [A[A
2025-05-07T20:26:41.4722781Z 
2025-05-07T20:26:41.4722784Z 
2025-05-07T20:26:41.4722788Z 
2025-05-07T20:26:41.4722956Z                                                      [A[A[A
2025-05-07T20:26:41.4723169Z 
2025-05-07T20:26:41.4723172Z 
2025-05-07T20:26:41.4723176Z 
2025-05-07T20:26:41.4723180Z 
2025-05-07T20:26:41.4723361Z                                                      [A[A[A[A
2025-05-07T20:26:41.4723576Z 
2025-05-07T20:26:41.4723579Z 
2025-05-07T20:26:41.4723583Z 
2025-05-07T20:26:41.4723586Z 
2025-05-07T20:26:41.4723590Z 
2025-05-07T20:26:41.4723772Z                                                      [A[A[A[A[A
2025-05-07T20:26:41.4723993Z 
2025-05-07T20:26:41.4723996Z 
2025-05-07T20:26:41.4724000Z 
2025-05-07T20:26:41.4724004Z 
2025-05-07T20:26:41.4724007Z 
2025-05-07T20:26:41.4724011Z 
2025-05-07T20:26:41.4724265Z                                                      [A[A[A[A[A[A done
2025-05-07T20:26:41.5726661Z Preparing transaction: \ done
2025-05-07T20:26:41.7736651Z Verifying transaction: / - done
2025-05-07T20:26:41.8746775Z Executing transaction: | done
2025-05-07T20:26:42.0540480Z ################################################################################
2025-05-07T20:26:42.0541014Z # Install Package From PyTorch PIP: torch
2025-05-07T20:26:42.0541378Z #
2025-05-07T20:26:42.0559038Z # [2025-05-07T20:26:42.055Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0
2025-05-07T20:26:42.0559694Z ################################################################################
2025-05-07T20:26:42.0559937Z 
2025-05-07T20:26:42.0576218Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:26:42.1471067Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:26:42.1471576Z ################################################################################
2025-05-07T20:26:42.1472287Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:26:42.1472591Z #
2025-05-07T20:26:42.1489948Z # [2025-05-07T20:26:42.148Z] + __prepare_pip_arguments torch nightly cuda/12.8.0
2025-05-07T20:26:42.1490687Z ################################################################################
2025-05-07T20:26:42.1490923Z 
2025-05-07T20:26:42.1513429Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:26:42.1537627Z [INSTALL] Extracted package variant: cu128
2025-05-07T20:26:42.1553700Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:26:42.1554255Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:26:42.1562719Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:26:42.1571484Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ...
2025-05-07T20:26:42.1592982Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:28:18.4533908Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:28:18.4534563Z Collecting torch
2025-05-07T20:28:18.4535301Z   Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp39-cp39-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:28:18.4536325Z Collecting filelock (from torch)
2025-05-07T20:28:18.4537029Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:28:18.4538206Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from torch) (4.13.2)
2025-05-07T20:28:18.4539013Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:28:18.4539535Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:28:18.4540693Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 34.5 MB/s eta 0:00:00
2025-05-07T20:28:18.4541174Z Collecting networkx (from torch)
2025-05-07T20:28:18.4541696Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.2.1-py3-none-any.whl (1.6 MB)
2025-05-07T20:28:18.4542388Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 16.3 MB/s eta 0:00:00
2025-05-07T20:28:18.4542740Z Collecting jinja2 (from torch)
2025-05-07T20:28:18.4543240Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:28:18.4543778Z Collecting fsspec (from torch)
2025-05-07T20:28:18.4544290Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:28:18.4544895Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch)
2025-05-07T20:28:18.4545779Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:18.4546686Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch)
2025-05-07T20:28:18.4547575Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:18.4548477Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch)
2025-05-07T20:28:18.4549352Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:18.4550393Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch)
2025-05-07T20:28:18.4551136Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB)
2025-05-07T20:28:18.4551893Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch)
2025-05-07T20:28:18.4552979Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:18.4553752Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch)
2025-05-07T20:28:18.4554591Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:18.4555647Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch)
2025-05-07T20:28:18.4556407Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:18.4557182Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch)
2025-05-07T20:28:18.4557950Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:18.4558727Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch)
2025-05-07T20:28:18.4559599Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:18.4560463Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:28:18.4561318Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB)
2025-05-07T20:28:18.4562081Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:28:18.4562911Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:28:18.4563734Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch)
2025-05-07T20:28:18.4564561Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:18.4565412Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch)
2025-05-07T20:28:18.4566282Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:18.4567164Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch)
2025-05-07T20:28:18.4568008Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:18.4568881Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:28:18.4569766Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:18.4571144Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1)
2025-05-07T20:28:18.4572061Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:28:18.4572647Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:28:18.4573340Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 5.0 MB/s eta 0:00:00
2025-05-07T20:28:18.4573729Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:28:18.4574465Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
2025-05-07T20:28:18.4575587Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp39-cp39-manylinux_2_28_x86_64.whl (1047.1 MB)
2025-05-07T20:28:18.4576429Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 GB 21.8 MB/s eta 0:00:00
2025-05-07T20:28:18.4577159Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB)
2025-05-07T20:28:18.4577992Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 609.6/609.6 MB 51.0 MB/s eta 0:00:00
2025-05-07T20:28:18.4578934Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB)
2025-05-07T20:28:18.4579851Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 169.1 MB/s eta 0:00:00
2025-05-07T20:28:18.4580757Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB)
2025-05-07T20:28:18.4581668Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 171.0 MB/s eta 0:00:00
2025-05-07T20:28:18.4582520Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB)
2025-05-07T20:28:18.4583745Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 kB 72.6 MB/s eta 0:00:00
2025-05-07T20:28:18.4584468Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB)
2025-05-07T20:28:18.4585298Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 698.0/698.0 MB 41.1 MB/s eta 0:00:00
2025-05-07T20:28:18.4586109Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB)
2025-05-07T20:28:18.4587022Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 137.5 MB/s eta 0:00:00
2025-05-07T20:28:18.4587832Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB)
2025-05-07T20:28:18.4588722Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 102.3 MB/s eta 0:00:00
2025-05-07T20:28:18.4589452Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB)
2025-05-07T20:28:18.4590353Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 146.7 MB/s eta 0:00:00
2025-05-07T20:28:18.4591103Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB)
2025-05-07T20:28:18.4591926Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 260.4/260.4 MB 127.8 MB/s eta 0:00:00
2025-05-07T20:28:18.4592772Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB)
2025-05-07T20:28:18.4593688Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 292.1/292.1 MB 109.1 MB/s eta 0:00:00
2025-05-07T20:28:18.4594421Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:28:18.4595247Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 136.5 MB/s eta 0:00:00
2025-05-07T20:28:18.4596046Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:28:18.4597095Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 130.5 MB/s eta 0:00:00
2025-05-07T20:28:18.4597921Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB)
2025-05-07T20:28:18.4598953Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.2/39.2 MB 161.8 MB/s eta 0:00:00
2025-05-07T20:28:18.4599742Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB)
2025-05-07T20:28:18.4600976Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.4 MB)
2025-05-07T20:28:18.4601894Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.4/153.4 MB 129.3 MB/s eta 0:00:00
2025-05-07T20:28:18.4603812Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:28:18.4605596Z 
2025-05-07T20:28:18.4607764Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.2.1 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128
2025-05-07T20:28:18.4610031Z 
2025-05-07T20:28:20.6807990Z torch                    2.8.0.dev20250507+cu128
2025-05-07T20:28:20.6809945Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128)
2025-05-07T20:28:24.1280971Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:27.5710726Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128
2025-05-07T20:28:27.5711349Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:30.9468982Z True
2025-05-07T20:28:30.9469228Z True
2025-05-07T20:28:30.9469335Z 
2025-05-07T20:28:31.0102249Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:31.0153989Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:31.0154618Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:31.0169109Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:31.0169472Z env:
2025-05-07T20:28:31.0169695Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:31.0170003Z   BUILD_ENV: build_binary
2025-05-07T20:28:31.0170245Z   BUILD_TARGET: genai
2025-05-07T20:28:31.0170478Z   BUILD_VARIANT: cuda
2025-05-07T20:28:31.0170710Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:28:31.0170964Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:31.0171285Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:31.0171631Z ##[endgroup]
2025-05-07T20:28:31.3519343Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:31.3521106Z ################################################################################
2025-05-07T20:28:31.3521609Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:31.3521996Z #
2025-05-07T20:28:31.3538426Z # [2025-05-07T20:28:31.353Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:31.3538830Z ################################################################################
2025-05-07T20:28:31.3539054Z 
2025-05-07T20:28:31.3554385Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:31.4592208Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:31.4603270Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:31.4603924Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:31.4606191Z 
2025-05-07T20:28:31.5501068Z 
2025-05-07T20:28:31.5501693Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:31.5524876Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:28:37.9066761Z Collecting environment information...
2025-05-07T20:28:37.9067177Z PyTorch version: 2.8.0.dev20250507+cu128
2025-05-07T20:28:37.9067489Z Is debug build: False
2025-05-07T20:28:37.9067739Z CUDA used to build PyTorch: 12.8
2025-05-07T20:28:37.9068028Z ROCM used to build PyTorch: N/A
2025-05-07T20:28:37.9068202Z 
2025-05-07T20:28:37.9068308Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:28:37.9068625Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:28:37.9068949Z Clang version: Could not collect
2025-05-07T20:28:37.9069225Z CMake version: Could not collect
2025-05-07T20:28:37.9069501Z Libc version: glibc-2.34
2025-05-07T20:28:37.9069674Z 
2025-05-07T20:28:37.9070150Z Python version: 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10)  [GCC 12.3.0] (64-bit runtime)
2025-05-07T20:28:37.9070802Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:28:37.9071224Z Is CUDA available: True
2025-05-07T20:28:37.9071465Z CUDA runtime version: 12.8.61
2025-05-07T20:28:37.9071736Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:28:37.9072046Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:28:37.9072371Z Nvidia driver version: 570.133.07
2025-05-07T20:28:37.9072650Z cuDNN version: Could not collect
2025-05-07T20:28:37.9072918Z HIP runtime version: N/A
2025-05-07T20:28:37.9073159Z MIOpen runtime version: N/A
2025-05-07T20:28:37.9073466Z Is XNNPACK available: True
2025-05-07T20:28:37.9073634Z 
2025-05-07T20:28:37.9073710Z CPU:
2025-05-07T20:28:37.9073932Z Architecture:                         x86_64
2025-05-07T20:28:37.9074263Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:28:37.9074672Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:28:37.9075073Z Byte Order:                           Little Endian
2025-05-07T20:28:37.9075743Z CPU(s):                               16
2025-05-07T20:28:37.9076046Z On-line CPU(s) list:                  0-15
2025-05-07T20:28:37.9076650Z Vendor ID:                            AuthenticAMD
2025-05-07T20:28:37.9077017Z Model name:                           AMD EPYC 7R32
2025-05-07T20:28:37.9077342Z CPU family:                           23
2025-05-07T20:28:37.9077622Z Model:                                49
2025-05-07T20:28:37.9077915Z Thread(s) per core:                   2
2025-05-07T20:28:37.9078215Z Core(s) per socket:                   8
2025-05-07T20:28:37.9078497Z Socket(s):                            1
2025-05-07T20:28:37.9078776Z Stepping:                             0
2025-05-07T20:28:37.9079091Z BogoMIPS:                             5600.00
2025-05-07T20:28:37.9081359Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:28:37.9083909Z Hypervisor vendor:                    KVM
2025-05-07T20:28:37.9084220Z Virtualization type:                  full
2025-05-07T20:28:37.9084571Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:28:37.9084949Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:28:37.9085319Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:28:37.9085675Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:28:37.9086180Z NUMA node(s):                         1
2025-05-07T20:28:37.9086480Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:28:37.9086813Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:28:37.9087201Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:28:37.9087570Z Vulnerability L1tf:                   Not affected
2025-05-07T20:28:37.9087916Z Vulnerability Mds:                    Not affected
2025-05-07T20:28:37.9088277Z Vulnerability Meltdown:               Not affected
2025-05-07T20:28:37.9088638Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:28:37.9089008Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:28:37.9089570Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:28:37.9090181Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:28:37.9090743Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:28:37.9091466Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:28:37.9092376Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:28:37.9093092Z Vulnerability Srbds:                  Not affected
2025-05-07T20:28:37.9093461Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:28:37.9093700Z 
2025-05-07T20:28:37.9093800Z Versions of relevant libraries:
2025-05-07T20:28:37.9094068Z [pip3] numpy==2.0.2
2025-05-07T20:28:37.9094309Z [pip3] nvidia-cublas-cu12==12.8.3.14
2025-05-07T20:28:37.9094613Z [pip3] nvidia-cuda-cupti-cu12==12.8.57
2025-05-07T20:28:37.9094924Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61
2025-05-07T20:28:37.9095238Z [pip3] nvidia-cuda-runtime-cu12==12.8.57
2025-05-07T20:28:37.9095555Z [pip3] nvidia-cudnn-cu12==9.8.0.87
2025-05-07T20:28:37.9095847Z [pip3] nvidia-cufft-cu12==11.3.3.41
2025-05-07T20:28:37.9096140Z [pip3] nvidia-curand-cu12==10.3.9.55
2025-05-07T20:28:37.9096442Z [pip3] nvidia-cusolver-cu12==11.7.2.55
2025-05-07T20:28:37.9096742Z [pip3] nvidia-cusparse-cu12==12.5.7.53
2025-05-07T20:28:37.9097198Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:28:37.9097500Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:28:37.9097783Z [pip3] nvidia-nvjitlink-cu12==12.8.61
2025-05-07T20:28:37.9098085Z [pip3] nvidia-nvtx-cu12==12.8.55
2025-05-07T20:28:37.9098372Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:28:37.9098672Z [pip3] torch==2.8.0.dev20250507+cu128
2025-05-07T20:28:37.9099059Z [conda] cuda-cudart               12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:37.9099568Z [conda] cuda-cudart-dev           12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:37.9100107Z [conda] cuda-cudart-dev_linux-64  12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:37.9100647Z [conda] cuda-cudart-static        12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:37.9101211Z [conda] cuda-cudart-static_linux-64 12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:37.9101775Z [conda] cuda-cudart_linux-64      12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:37.9102275Z [conda] cuda-cupti                12.8.57              hbd13f7d_0    conda-forge
2025-05-07T20:28:37.9102845Z [conda] cuda-cupti-dev            12.8.57              h5888daf_0    conda-forge
2025-05-07T20:28:37.9103354Z [conda] cuda-libraries            12.8.0               ha770c72_0    conda-forge
2025-05-07T20:28:37.9103874Z [conda] cuda-libraries-dev        12.8.0               ha770c72_0    conda-forge
2025-05-07T20:28:37.9104364Z [conda] cuda-nvrtc                12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:28:37.9104855Z [conda] cuda-nvrtc-dev            12.8.61              h5888daf_0    conda-forge
2025-05-07T20:28:37.9105394Z [conda] cuda-nvtx                 12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:28:37.9105979Z [conda] cuda-opencl               12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:28:37.9106472Z [conda] cuda-opencl-dev           12.8.55              h5888daf_0    conda-forge
2025-05-07T20:28:37.9106975Z [conda] cuda-runtime              12.8.0               ha804496_0    conda-forge
2025-05-07T20:28:37.9107456Z [conda] libcublas                 12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:28:37.9107942Z [conda] libcublas-dev             12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:28:37.9108426Z [conda] libcufft                  11.3.3.41            hbd13f7d_0    conda-forge
2025-05-07T20:28:37.9108901Z [conda] libcufft-dev              11.3.3.41            h5888daf_0    conda-forge
2025-05-07T20:28:37.9109384Z [conda] libcurand                 10.3.9.55            hbd13f7d_0    conda-forge
2025-05-07T20:28:37.9109986Z [conda] libcurand-dev             10.3.9.55            h5888daf_0    conda-forge
2025-05-07T20:28:37.9110486Z [conda] libcusolver               11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:28:37.9110988Z [conda] libcusolver-dev           11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:28:37.9111491Z [conda] libcusparse               12.5.7.53            hbd13f7d_0    conda-forge
2025-05-07T20:28:37.9111993Z [conda] libcusparse-dev           12.5.7.53            h5888daf_0    conda-forge
2025-05-07T20:28:37.9112496Z [conda] libnvjitlink              12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:28:37.9113001Z [conda] libnvjitlink-dev          12.8.61              h5888daf_0    conda-forge
2025-05-07T20:28:37.9113477Z [conda] numpy                     2.0.2            py39h9cb892a_1    conda-forge
2025-05-07T20:28:37.9113958Z [conda] nvidia-cublas-cu12        12.8.3.14                pypi_0    pypi
2025-05-07T20:28:37.9114478Z [conda] nvidia-cuda-cupti-cu12    12.8.57                  pypi_0    pypi
2025-05-07T20:28:37.9114993Z [conda] nvidia-cuda-nvrtc-cu12    12.8.61                  pypi_0    pypi
2025-05-07T20:28:37.9115522Z [conda] nvidia-cuda-runtime-cu12  12.8.57                  pypi_0    pypi
2025-05-07T20:28:37.9116032Z [conda] nvidia-cudnn-cu12         9.8.0.87                 pypi_0    pypi
2025-05-07T20:28:37.9116621Z [conda] nvidia-cufft-cu12         11.3.3.41                pypi_0    pypi
2025-05-07T20:28:37.9117138Z [conda] nvidia-curand-cu12        10.3.9.55                pypi_0    pypi
2025-05-07T20:28:37.9117668Z [conda] nvidia-cusolver-cu12      11.7.2.55                pypi_0    pypi
2025-05-07T20:28:37.9118178Z [conda] nvidia-cusparse-cu12      12.5.7.53                pypi_0    pypi
2025-05-07T20:28:37.9118692Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:28:37.9119205Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:28:37.9119707Z [conda] nvidia-nvjitlink-cu12     12.8.61                  pypi_0    pypi
2025-05-07T20:28:37.9120209Z [conda] nvidia-nvtx-cu12          12.8.55                  pypi_0    pypi
2025-05-07T20:28:37.9120704Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:28:37.9121182Z [conda] torch                     2.8.0.dev20250507+cu128          pypi_0    pypi
2025-05-07T20:28:37.9121466Z 
2025-05-07T20:28:37.9853091Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:28:37.9853838Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:28:37.9866846Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:37.9867205Z env:
2025-05-07T20:28:37.9867452Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:37.9867779Z   BUILD_ENV: build_binary
2025-05-07T20:28:37.9868023Z   BUILD_TARGET: genai
2025-05-07T20:28:37.9868249Z   BUILD_VARIANT: cuda
2025-05-07T20:28:37.9868474Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:28:37.9868730Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:37.9869030Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:37.9869546Z ##[endgroup]
2025-05-07T20:28:38.3231857Z ################################################################################
2025-05-07T20:28:38.3232261Z # Prepare FBGEMM-GPU Build
2025-05-07T20:28:38.3232513Z #
2025-05-07T20:28:38.3247324Z # [2025-05-07T20:28:38.324Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:28:38.3247744Z ################################################################################
2025-05-07T20:28:38.3247971Z 
2025-05-07T20:28:38.3264117Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:38.4170393Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:38.4192656Z [BUILD] Running git submodules update ...
2025-05-07T20:28:38.4215334Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:28:38.4577511Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:28:38.4577996Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:28:38.4578470Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:28:38.4578877Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:28:38.4579287Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:28:38.4579943Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:28:38.4580467Z Synchronizing submodule url for '../external/json'
2025-05-07T20:28:38.4612917Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:28:38.5164335Z [BUILD] Installing other build dependencies ...
2025-05-07T20:28:38.5185916Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:28:40.9561670Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:28:41.0232994Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:28:41.1271935Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:28:41.1308120Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:28:41.3941526Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:28:41.3977797Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:28:41.5117635Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:28:41.5152134Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:28:41.8879519Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:28:41.8918458Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:28:41.9509238Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:28:41.9512887Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:28:42.0390518Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:28:42.0428613Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:28:42.0915830Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 21)) (2.0.2)
2025-05-07T20:28:42.1491502Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:28:42.1527263Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:28:42.2776684Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:28:42.2808796Z   Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:28:42.3986327Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:28:42.4016698Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:28:42.4672888Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:28:42.5355336Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:28:42.5539681Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:28:42.6492260Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:28:42.6553306Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:28:42.7847214Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:28:42.7881448Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:28:42.9099906Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:42.9145284Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:28:43.0158507Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:28:43.0196964Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:28:43.1719505Z Collecting importlib-metadata>=4.6 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:43.1773351Z   Downloading importlib_metadata-8.7.0-py3-none-any.whl.metadata (4.8 kB)
2025-05-07T20:28:43.3012700Z Collecting tomli>=1.1.0 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:43.3047335Z   Downloading tomli-2.2.1-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:43.4257205Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:43.4314521Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:43.5521319Z Collecting exceptiongroup>=1.0.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:43.5559499Z   Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB)
2025-05-07T20:28:43.6597080Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:43.6639108Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:43.7243906Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:28:43.7740008Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:43.7776394Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:28:43.8288220Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:28:43.8821112Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:28:43.8854531Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:28:43.9317805Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:28:44.0152578Z Collecting zipp>=3.20 (from importlib-metadata>=4.6->build->-r requirements.txt (line 14))
2025-05-07T20:28:44.0183241Z   Downloading zipp-3.21.0-py3-none-any.whl.metadata (3.7 kB)
2025-05-07T20:28:44.1294220Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:44.1322472Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:28:44.1869058Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:28:44.2436919Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:28:44.3037650Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:28:44.8319453Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 52.9 MB/s eta 0:00:00
2025-05-07T20:28:44.8351876Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:28:44.8995601Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:28:44.9777938Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:28:45.0527258Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:28:45.1091427Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:28:45.1605958Z Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (737 kB)
2025-05-07T20:28:45.2275772Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 737.4/737.4 kB 7.4 MB/s eta 0:00:00
2025-05-07T20:28:45.2468710Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:28:45.3100258Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:45.3680954Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:28:45.4200775Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:28:45.4841483Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:28:45.5469079Z Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB)
2025-05-07T20:28:45.6061249Z Downloading importlib_metadata-8.7.0-py3-none-any.whl (27 kB)
2025-05-07T20:28:45.6670499Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:28:45.7274356Z Downloading tomli-2.2.1-py3-none-any.whl (14 kB)
2025-05-07T20:28:45.7858355Z Downloading zipp-3.21.0-py3-none-any.whl (9.6 kB)
2025-05-07T20:28:45.8475643Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:28:45.9062807Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:45.9670753Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:28:46.0252028Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:28:46.2745801Z Installing collected packages: sortedcontainers, zipp, tomli, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, exceptiongroup, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, importlib-metadata, hypothesis, pyre-extensions, build
2025-05-07T20:28:48.7203609Z 
2025-05-07T20:28:48.7281830Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 exceptiongroup-1.2.2 hypothesis-6.131.14 importlib-metadata-8.7.0 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 tomli-2.2.1 typing-inspect-0.9.0 zipp-3.21.0
2025-05-07T20:28:48.9203938Z ################################################################################
2025-05-07T20:28:48.9204305Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:28:48.9204584Z #
2025-05-07T20:28:48.9221980Z # [2025-05-07T20:28:48.921Z] + install_triton_pip build_binary
2025-05-07T20:28:48.9222399Z ################################################################################
2025-05-07T20:28:48.9222627Z 
2025-05-07T20:28:48.9222871Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:28:48.9223566Z ################################################################################
2025-05-07T20:28:48.9223949Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:28:48.9224279Z #
2025-05-07T20:28:48.9238452Z # [2025-05-07T20:28:48.923Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:28:48.9239002Z ################################################################################
2025-05-07T20:28:48.9239233Z 
2025-05-07T20:28:48.9254300Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:49.0147743Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:49.0148470Z ################################################################################
2025-05-07T20:28:49.0149562Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:28:49.0150092Z #
2025-05-07T20:28:49.0164854Z # [2025-05-07T20:28:49.016Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:28:49.0165375Z ################################################################################
2025-05-07T20:28:49.0165603Z 
2025-05-07T20:28:49.0211910Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:28:49.0228425Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:28:49.0229041Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:49.0237817Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:49.0247659Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:28:49.0269336Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:56.7623838Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:28:56.7625191Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:28:56.7625894Z 
2025-05-07T20:28:56.7626116Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:56.7626554Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:56.7627416Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:28:56.7628732Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.4 MB)
2025-05-07T20:28:56.7630020Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.4/166.4 MB 54.5 MB/s eta 0:00:00
2025-05-07T20:28:56.7630422Z Installing collected packages: pytorch-triton
2025-05-07T20:28:56.7630788Z   Attempting uninstall: pytorch-triton
2025-05-07T20:28:56.7631186Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:28:56.7631629Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:28:56.7632066Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:28:56.7632519Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:28:56.7632796Z 
2025-05-07T20:28:58.9808428Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:28:58.9812707Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:29:01.1232901Z ################################################################################
2025-05-07T20:29:01.1233369Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:29:01.1233794Z ################################################################################
2025-05-07T20:29:01.1234017Z 
2025-05-07T20:29:03.1634046Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:29:05.3472799Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:29:05.3475759Z [BUILD] Successfully ran git submodules update
2025-05-07T20:29:05.3510641Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:29:05.3511146Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:29:05.3525429Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:05.3525780Z env:
2025-05-07T20:29:05.3526000Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:05.3526305Z   BUILD_ENV: build_binary
2025-05-07T20:29:05.3526541Z   BUILD_TARGET: genai
2025-05-07T20:29:05.3526763Z   BUILD_VARIANT: cuda
2025-05-07T20:29:05.3526992Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:29:05.3527437Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:05.3527736Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:05.3528073Z ##[endgroup]
2025-05-07T20:29:05.6867860Z ################################################################################
2025-05-07T20:29:05.6868260Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:29:05.6868531Z #
2025-05-07T20:29:05.6884876Z # [2025-05-07T20:29:05.688Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:29:05.6885571Z ################################################################################
2025-05-07T20:29:05.6885802Z 
2025-05-07T20:29:05.6886181Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:29:05.6886915Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:29:05.6887274Z 
2025-05-07T20:29:05.7048762Z 94d0750d60163e549c1eb2cb2a791ec2cf9a4d41  fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:29:05.7051188Z 
2025-05-07T20:29:05.7051618Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:29:05.7052006Z 
2025-05-07T20:29:05.7236382Z 4ad1704987fa87cd63915598dc05a53ebebd35ab51336336eb8f0056001f042a  fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:29:05.7239298Z 
2025-05-07T20:29:05.7239841Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:29:05.7240202Z 
2025-05-07T20:29:05.7576702Z 5c45ae153a493153a2b0776bec42bc74  fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:29:05.7579707Z 
2025-05-07T20:29:05.7589821Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl ...
2025-05-07T20:29:05.7611565Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:29:08.5948730Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl
2025-05-07T20:29:08.5949747Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.0.2)
2025-05-07T20:29:08.5950774Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:29:08.5951232Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:29:08.5951513Z 
2025-05-07T20:29:15.5271305Z ################################################################################
2025-05-07T20:29:15.5271683Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:29:15.5272060Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128
2025-05-07T20:29:15.5272499Z [CHECK] CUDA version reported by PyTorch is: 12.8
2025-05-07T20:29:15.5272819Z [CHECK]
2025-05-07T20:29:15.5273145Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:29:15.5273682Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:29:15.5274090Z ################################################################################
2025-05-07T20:29:15.5274316Z 
2025-05-07T20:29:15.5274429Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:29:19.4582958Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:29:23.3710647Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:27.3055782Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:27.3059805Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:29:39.0598521Z ################################################################################
2025-05-07T20:29:39.0599027Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:29:39.0599507Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:29:39.0599971Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:29:39.0601880Z ################################################################################
2025-05-07T20:29:39.0602264Z 
2025-05-07T20:29:46.8937660Z ################################################################################
2025-05-07T20:29:46.8938478Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:29:46.8941476Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:29:46.8944598Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:29:46.8945153Z ################################################################################
2025-05-07T20:29:46.8945390Z 
2025-05-07T20:29:46.8945544Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:29:50.8113273Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:29:54.7288429Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:29:58.7681888Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:30:02.6917090Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:30:02.6921671Z [INSTALL] Check for operator registrations ...
2025-05-07T20:30:06.5370687Z fbgemm.nccl_init
2025-05-07T20:30:06.5370870Z 
2025-05-07T20:30:06.6010944Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:30:10.4539934Z fbgemm.gqa_attn_splitk
2025-05-07T20:30:10.4540216Z 
2025-05-07T20:30:10.5182361Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:30:14.3672268Z fbgemm.rope_qkv_decoding
2025-05-07T20:30:14.3672508Z 
2025-05-07T20:30:14.4308322Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:30:14.4308986Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:30:14.4353281Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:30:14.4353764Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:30:14.4367045Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:30:14.4367401Z env:
2025-05-07T20:30:14.4367625Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:30:14.4367923Z   BUILD_ENV: build_binary
2025-05-07T20:30:14.4368165Z   BUILD_TARGET: genai
2025-05-07T20:30:14.4368391Z   BUILD_VARIANT: cuda
2025-05-07T20:30:14.4368622Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:30:14.4368872Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:30:14.4369177Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:30:14.4369516Z ##[endgroup]
2025-05-07T20:30:14.7733260Z ################################################################################
2025-05-07T20:30:14.7733644Z # Test All FBGEMM-GPU Modules
2025-05-07T20:30:14.7733903Z #
2025-05-07T20:30:14.7749367Z # [2025-05-07T20:30:14.774Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:30:14.7749794Z ################################################################################
2025-05-07T20:30:14.7750160Z 
2025-05-07T20:30:22.6019246Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:30:22.6020058Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:30:22.6020599Z [TEST] Determined the test directories:
2025-05-07T20:30:22.6021027Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:30:22.6021445Z fbgemm_gpu/experimental/example/test
2025-05-07T20:30:22.6021849Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:30:22.6022117Z 
2025-05-07T20:30:22.6030687Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:30:22.6037576Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:30:22.6038037Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:30:22.6038335Z 
2025-05-07T20:30:23.0259358Z 
2025-05-07T20:30:23.0259820Z [TEST] Installing PyTest ...
2025-05-07T20:30:23.0282470Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:30:24.1315285Z Channels:
2025-05-07T20:30:24.1315532Z  - conda-forge
2025-05-07T20:30:24.1315764Z Platform: linux-64
2025-05-07T20:30:27.4373070Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:28.5846450Z Solving environment: \ | / done
2025-05-07T20:30:28.8138510Z 
2025-05-07T20:30:28.8139000Z ## Package Plan ##
2025-05-07T20:30:28.8139363Z 
2025-05-07T20:30:28.8139790Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:28.8140444Z 
2025-05-07T20:30:28.8140627Z   added / updated specs:
2025-05-07T20:30:28.8141146Z     - expecttest
2025-05-07T20:30:28.8141560Z     - pytest
2025-05-07T20:30:28.8141803Z 
2025-05-07T20:30:28.8141813Z 
2025-05-07T20:30:28.8142042Z The following packages will be downloaded:
2025-05-07T20:30:28.8142516Z 
2025-05-07T20:30:28.8142740Z     package                    |            build
2025-05-07T20:30:28.8143392Z     ---------------------------|-----------------
2025-05-07T20:30:28.8144164Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:28.8144825Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:28.8145310Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:28.8145767Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:28.8146216Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:28.8146664Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:28.8147095Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:28.8147961Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:28.8148371Z     ------------------------------------------------------------
2025-05-07T20:30:28.8148724Z                                            Total:         428 KB
2025-05-07T20:30:28.8148939Z 
2025-05-07T20:30:28.8149073Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:28.8149297Z 
2025-05-07T20:30:28.8149505Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:28.8150153Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:28.8150700Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:28.8151186Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:28.8151673Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:28.8152141Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:28.8152591Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:28.8153021Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:28.8153295Z 
2025-05-07T20:30:28.8153299Z 
2025-05-07T20:30:28.8153303Z 
2025-05-07T20:30:28.8153452Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:28.8153829Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:28.8154060Z 
2025-05-07T20:30:28.8154521Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:28.8154793Z 
2025-05-07T20:30:28.8154796Z 
2025-05-07T20:30:28.8156691Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:28.8156947Z 
2025-05-07T20:30:28.8156951Z 
2025-05-07T20:30:28.8167688Z 
2025-05-07T20:30:28.8183326Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:28.8183832Z 
2025-05-07T20:30:28.8183837Z 
2025-05-07T20:30:28.8183840Z 
2025-05-07T20:30:28.8183844Z 
2025-05-07T20:30:28.8194817Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:28.8195128Z 
2025-05-07T20:30:28.8195132Z 
2025-05-07T20:30:28.8195136Z 
2025-05-07T20:30:28.8195140Z 
2025-05-07T20:30:28.8195143Z 
2025-05-07T20:30:28.8196027Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:28.8196288Z 
2025-05-07T20:30:28.8196292Z 
2025-05-07T20:30:28.8196295Z 
2025-05-07T20:30:28.8196299Z 
2025-05-07T20:30:28.8196302Z 
2025-05-07T20:30:28.8196306Z 
2025-05-07T20:30:28.8198330Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:28.8198615Z 
2025-05-07T20:30:28.8198619Z 
2025-05-07T20:30:28.8198623Z 
2025-05-07T20:30:28.8198626Z 
2025-05-07T20:30:28.8198630Z 
2025-05-07T20:30:28.8198634Z 
2025-05-07T20:30:28.8198637Z 
2025-05-07T20:30:28.8991412Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:28.8991712Z 
2025-05-07T20:30:28.8991716Z 
2025-05-07T20:30:28.8991720Z 
2025-05-07T20:30:28.9898281Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:28.9898565Z 
2025-05-07T20:30:28.9898690Z 
2025-05-07T20:30:28.9898859Z 
2025-05-07T20:30:28.9898868Z 
2025-05-07T20:30:28.9898877Z 
2025-05-07T20:30:29.0166179Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:30:29.0166466Z 
2025-05-07T20:30:29.0166470Z 
2025-05-07T20:30:29.0166482Z 
2025-05-07T20:30:29.0166485Z 
2025-05-07T20:30:29.0168841Z 
2025-05-07T20:30:29.1311424Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:29.1311972Z 
2025-05-07T20:30:29.1311993Z 
2025-05-07T20:30:29.1312001Z 
2025-05-07T20:30:29.1312008Z 
2025-05-07T20:30:29.1312015Z 
2025-05-07T20:30:29.1312023Z 
2025-05-07T20:30:29.1768010Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:29.1768349Z 
2025-05-07T20:30:29.1768353Z 
2025-05-07T20:30:29.1768357Z 
2025-05-07T20:30:29.1768360Z 
2025-05-07T20:30:29.1768364Z 
2025-05-07T20:30:29.1769742Z 
2025-05-07T20:30:29.1780270Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:29.2000159Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:30:29.2093158Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:29.2093498Z 
2025-05-07T20:30:29.2093504Z 
2025-05-07T20:30:29.2093509Z 
2025-05-07T20:30:29.2099424Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:29.2099757Z 
2025-05-07T20:30:29.2099762Z 
2025-05-07T20:30:29.2099766Z 
2025-05-07T20:30:29.2142506Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:29.2142796Z 
2025-05-07T20:30:29.2142800Z 
2025-05-07T20:30:29.2142803Z 
2025-05-07T20:30:29.2142807Z 
2025-05-07T20:30:29.2142810Z 
2025-05-07T20:30:29.2142814Z 
2025-05-07T20:30:29.2149321Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:29.2149725Z 
2025-05-07T20:30:29.2149730Z 
2025-05-07T20:30:29.2149734Z 
2025-05-07T20:30:29.2149737Z 
2025-05-07T20:30:29.2149741Z 
2025-05-07T20:30:29.2149744Z 
2025-05-07T20:30:29.2151090Z 
2025-05-07T20:30:29.2158122Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:29.2158413Z 
2025-05-07T20:30:29.2158418Z 
2025-05-07T20:30:29.2158422Z 
2025-05-07T20:30:29.2158425Z 
2025-05-07T20:30:29.2159161Z 
2025-05-07T20:30:29.2161999Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:29.2162331Z 
2025-05-07T20:30:29.2162335Z 
2025-05-07T20:30:29.2162339Z 
2025-05-07T20:30:29.2162342Z 
2025-05-07T20:30:29.2162346Z 
2025-05-07T20:30:29.2162349Z 
2025-05-07T20:30:29.2162353Z 
2025-05-07T20:30:29.2293911Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:29.2294224Z 
2025-05-07T20:30:29.2294228Z 
2025-05-07T20:30:29.2294232Z 
2025-05-07T20:30:29.2294423Z 
2025-05-07T20:30:29.2294426Z 
2025-05-07T20:30:29.2294430Z 
2025-05-07T20:30:29.2294434Z 
2025-05-07T20:30:29.2402970Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:29.2403266Z 
2025-05-07T20:30:29.2403270Z 
2025-05-07T20:30:29.2403274Z 
2025-05-07T20:30:29.2403278Z 
2025-05-07T20:30:29.2409580Z exceptiongroup-1.2.2 | 20 KB     | #######9   |  80% [A[A[A[A
2025-05-07T20:30:29.2409954Z 
2025-05-07T20:30:29.2409959Z 
2025-05-07T20:30:29.2409962Z 
2025-05-07T20:30:29.2409966Z 
2025-05-07T20:30:29.2491768Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:29.2492129Z 
2025-05-07T20:30:29.2492133Z 
2025-05-07T20:30:29.2516486Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:30:29.2516822Z 
2025-05-07T20:30:29.2516826Z 
2025-05-07T20:30:29.2529422Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:29.2590897Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:29.2591464Z 
2025-05-07T20:30:29.2591473Z 
2025-05-07T20:30:29.2591480Z 
2025-05-07T20:30:29.2591488Z 
2025-05-07T20:30:29.2648196Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:29.2648515Z 
2025-05-07T20:30:29.2648519Z 
2025-05-07T20:30:29.2754137Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:29.2754490Z 
2025-05-07T20:30:29.2769926Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:30:29.2770189Z 
2025-05-07T20:30:29.2856138Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:29.2856398Z 
2025-05-07T20:30:29.2863101Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:29.2863569Z                                                      
2025-05-07T20:30:29.2863856Z 
2025-05-07T20:30:29.2864033Z                                                      [A
2025-05-07T20:30:29.2864248Z 
2025-05-07T20:30:29.2864261Z 
2025-05-07T20:30:29.2864428Z                                                      [A[A
2025-05-07T20:30:29.2864646Z 
2025-05-07T20:30:29.2864650Z 
2025-05-07T20:30:29.2864653Z 
2025-05-07T20:30:29.2864832Z                                                      [A[A[A
2025-05-07T20:30:29.2865304Z 
2025-05-07T20:30:29.2865310Z 
2025-05-07T20:30:29.2865315Z 
2025-05-07T20:30:29.2865319Z 
2025-05-07T20:30:29.2865526Z                                                      [A[A[A[A
2025-05-07T20:30:29.2865747Z 
2025-05-07T20:30:29.2865751Z 
2025-05-07T20:30:29.2865755Z 
2025-05-07T20:30:29.2865759Z 
2025-05-07T20:30:29.2865762Z 
2025-05-07T20:30:29.2865949Z                                                      [A[A[A[A[A
2025-05-07T20:30:29.2866165Z 
2025-05-07T20:30:29.2866168Z 
2025-05-07T20:30:29.2866172Z 
2025-05-07T20:30:29.2866176Z 
2025-05-07T20:30:29.2866179Z 
2025-05-07T20:30:29.2866183Z 
2025-05-07T20:30:29.2866368Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:29.2866588Z 
2025-05-07T20:30:29.2866596Z 
2025-05-07T20:30:29.2866600Z 
2025-05-07T20:30:29.2866603Z 
2025-05-07T20:30:29.2866607Z 
2025-05-07T20:30:29.2866611Z 
2025-05-07T20:30:29.2866614Z 
2025-05-07T20:30:29.2866824Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:29.3870860Z Preparing transaction: \ done
2025-05-07T20:30:29.4875729Z Verifying transaction: / done
2025-05-07T20:30:31.3902237Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:30:31.5216265Z [TEST] Checking imports ...
2025-05-07T20:30:35.4487811Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:35.4499735Z [TEST] Setting feature flags ...
2025-05-07T20:30:35.4500166Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:30:35.4500530Z 
2025-05-07T20:30:35.8694660Z 
2025-05-07T20:30:35.8695058Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:30:35.8696669Z ################################################################################
2025-05-07T20:30:35.8697307Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:30:35.8697558Z #
2025-05-07T20:30:35.8716641Z # [2025-05-07T20:30:35.871Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:30:35.8717077Z ################################################################################
2025-05-07T20:30:35.8717299Z 
2025-05-07T20:30:35.8724427Z [TEST] Enumerating ALL test files ...
2025-05-07T20:30:35.8753237Z ./attention/gqa_test.py
2025-05-07T20:30:35.8753528Z ./coalesce/coalesce_test.py
2025-05-07T20:30:35.8753804Z ./comm/multi_gpu_car_test.py
2025-05-07T20:30:35.8754089Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:30:35.8754404Z ./kv_cache/kv_cache_test.py
2025-05-07T20:30:35.8754667Z ./moe/activation_test.py
2025-05-07T20:30:35.8754922Z ./moe/gather_scatter_test.py
2025-05-07T20:30:35.8755184Z ./moe/layers_test.py
2025-05-07T20:30:35.8755425Z ./moe/shuffling_test.py
2025-05-07T20:30:35.8755670Z ./quantize/quantize_test.py
2025-05-07T20:30:35.8755862Z 
2025-05-07T20:30:35.8755980Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:30:35.8756207Z 
2025-05-07T20:30:35.8773450Z ################################################################################
2025-05-07T20:30:35.8789000Z # [2025-05-07T20:30:35.878Z] Run Python Test Suite:
2025-05-07T20:30:35.8789328Z #   ./attention/gqa_test.py
2025-05-07T20:30:35.8789605Z ################################################################################
2025-05-07T20:30:35.8812828Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:30:35.8813483Z 
2025-05-07T20:30:38.4237855Z ============================= test session starts ==============================
2025-05-07T20:30:38.4238537Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:38.4239096Z cachedir: .pytest_cache
2025-05-07T20:30:38.4239729Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:38.4240774Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:38.4241212Z plugins: hypothesis-6.131.14
2025-05-07T20:30:39.9589494Z collecting ... collected 2 items
2025-05-07T20:30:39.9589828Z 
2025-05-07T20:31:16.4684646Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:31:16.4685235Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4685748Z     int4_kv=False,
2025-05-07T20:31:16.4686004Z     num_groups=1,
2025-05-07T20:31:16.4686253Z     B=1,
2025-05-07T20:31:16.4686479Z     MAX_T=4,
2025-05-07T20:31:16.4686705Z     N_H_L=1,
2025-05-07T20:31:16.4686940Z )
2025-05-07T20:31:16.4687170Z Trying example: test_gqa(
2025-05-07T20:31:16.4687520Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4687906Z     int4_kv=True,
2025-05-07T20:31:16.4688184Z     num_groups=1,
2025-05-07T20:31:16.4688421Z     B=1,
2025-05-07T20:31:16.4688640Z     MAX_T=4,
2025-05-07T20:31:16.4688872Z     N_H_L=1,
2025-05-07T20:31:16.4689091Z )
2025-05-07T20:31:16.4689332Z Trying example: test_gqa(
2025-05-07T20:31:16.4689680Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4690053Z     int4_kv=True,
2025-05-07T20:31:16.4690300Z     num_groups=4,
2025-05-07T20:31:16.4690545Z     B=23,
2025-05-07T20:31:16.4690759Z     MAX_T=33,
2025-05-07T20:31:16.4690993Z     N_H_L=68,
2025-05-07T20:31:16.4691221Z )
2025-05-07T20:31:16.4691443Z Trying example: test_gqa(
2025-05-07T20:31:16.4691789Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4692169Z     int4_kv=True,
2025-05-07T20:31:16.4692408Z     num_groups=4,
2025-05-07T20:31:16.4692651Z     B=77,
2025-05-07T20:31:16.4692872Z     MAX_T=4,
2025-05-07T20:31:16.4693096Z     N_H_L=1,
2025-05-07T20:31:16.4693319Z )
2025-05-07T20:31:16.4693549Z Trying example: test_gqa(
2025-05-07T20:31:16.4694316Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4694704Z     int4_kv=True,
2025-05-07T20:31:16.4694953Z     num_groups=4,
2025-05-07T20:31:16.4695197Z     B=77,
2025-05-07T20:31:16.4695418Z     MAX_T=52,
2025-05-07T20:31:16.4695651Z     N_H_L=67,
2025-05-07T20:31:16.4695879Z )
2025-05-07T20:31:16.4696100Z Trying example: test_gqa(
2025-05-07T20:31:16.4696448Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4696830Z     int4_kv=False,
2025-05-07T20:31:16.4697074Z     num_groups=4,
2025-05-07T20:31:16.4697318Z     B=57,
2025-05-07T20:31:16.4697538Z     MAX_T=45,
2025-05-07T20:31:16.4697771Z     N_H_L=120,
2025-05-07T20:31:16.4698008Z )
2025-05-07T20:31:16.4698242Z Trying example: test_gqa(
2025-05-07T20:31:16.4698586Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4698970Z     int4_kv=True,
2025-05-07T20:31:16.4699222Z     num_groups=4,
2025-05-07T20:31:16.4699462Z     B=52,
2025-05-07T20:31:16.4699695Z     MAX_T=42,
2025-05-07T20:31:16.4699929Z     N_H_L=53,
2025-05-07T20:31:16.4700152Z )
2025-05-07T20:31:16.4700381Z Trying example: test_gqa(
2025-05-07T20:31:16.4700731Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4701112Z     int4_kv=True,
2025-05-07T20:31:16.4701362Z     num_groups=1,
2025-05-07T20:31:16.4701607Z     B=77,
2025-05-07T20:31:16.4701822Z     MAX_T=95,
2025-05-07T20:31:16.4702053Z     N_H_L=53,
2025-05-07T20:31:16.4702282Z )
2025-05-07T20:31:16.4702504Z Trying example: test_gqa(
2025-05-07T20:31:16.4702851Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4703228Z     int4_kv=True,
2025-05-07T20:31:16.4703480Z     num_groups=4,
2025-05-07T20:31:16.4703718Z     B=113,
2025-05-07T20:31:16.4703944Z     MAX_T=48,
2025-05-07T20:31:16.4704176Z     N_H_L=96,
2025-05-07T20:31:16.4704395Z )
2025-05-07T20:31:16.4704624Z Trying example: test_gqa(
2025-05-07T20:31:16.4704971Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4705357Z     int4_kv=False,
2025-05-07T20:31:16.4705611Z     num_groups=1,
2025-05-07T20:31:16.4705858Z     B=51,
2025-05-07T20:31:16.4706075Z     MAX_T=61,
2025-05-07T20:31:16.4706307Z     N_H_L=69,
2025-05-07T20:31:16.4706735Z )
2025-05-07T20:31:16.4706964Z Trying example: test_gqa(
2025-05-07T20:31:16.4707315Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4707699Z     int4_kv=False,
2025-05-07T20:31:16.4707949Z     num_groups=4,
2025-05-07T20:31:16.4708194Z     B=17,
2025-05-07T20:31:16.4708416Z     MAX_T=113,
2025-05-07T20:31:16.4708643Z     N_H_L=65,
2025-05-07T20:31:16.4708872Z )
2025-05-07T20:31:16.4709102Z Trying example: test_gqa(
2025-05-07T20:31:16.4709444Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4709995Z     int4_kv=False,
2025-05-07T20:31:16.4710253Z     num_groups=4,
2025-05-07T20:31:16.4710491Z     B=17,
2025-05-07T20:31:16.4710713Z     MAX_T=65,
2025-05-07T20:31:16.4710949Z     N_H_L=65,
2025-05-07T20:31:16.4711169Z )
2025-05-07T20:31:16.4711403Z Trying example: test_gqa(
2025-05-07T20:31:16.4711752Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4712137Z     int4_kv=False,
2025-05-07T20:31:16.4712385Z     num_groups=4,
2025-05-07T20:31:16.4712638Z     B=65,
2025-05-07T20:31:16.4712865Z     MAX_T=65,
2025-05-07T20:31:16.4713091Z     N_H_L=65,
2025-05-07T20:31:16.4713315Z )
2025-05-07T20:31:16.4713543Z Trying example: test_gqa(
2025-05-07T20:31:16.4713879Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4714260Z     int4_kv=False,
2025-05-07T20:31:16.4714512Z     num_groups=1,
2025-05-07T20:31:16.4714749Z     B=6,
2025-05-07T20:31:16.4714971Z     MAX_T=108,
2025-05-07T20:31:16.4715209Z     N_H_L=14,
2025-05-07T20:31:16.4715434Z )
2025-05-07T20:31:16.4715664Z Trying example: test_gqa(
2025-05-07T20:31:16.4716009Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4716385Z     int4_kv=False,
2025-05-07T20:31:16.4716635Z     num_groups=1,
2025-05-07T20:31:16.4716973Z     B=6,
2025-05-07T20:31:16.4717185Z     MAX_T=14,
2025-05-07T20:31:16.4717414Z     N_H_L=14,
2025-05-07T20:31:16.4717644Z )
2025-05-07T20:31:16.4717864Z Trying example: test_gqa(
2025-05-07T20:31:16.4718214Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4718592Z     int4_kv=False,
2025-05-07T20:31:16.4718836Z     num_groups=1,
2025-05-07T20:31:16.4719080Z     B=6,
2025-05-07T20:31:16.4719299Z     MAX_T=6,
2025-05-07T20:31:16.4719523Z     N_H_L=14,
2025-05-07T20:31:16.4719755Z )
2025-05-07T20:31:16.4719983Z Trying example: test_gqa(
2025-05-07T20:31:16.4720322Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4720706Z     int4_kv=False,
2025-05-07T20:31:16.4720954Z     num_groups=1,
2025-05-07T20:31:16.4721195Z     B=6,
2025-05-07T20:31:16.4721409Z     MAX_T=6,
2025-05-07T20:31:16.4721637Z     N_H_L=6,
2025-05-07T20:31:16.4721860Z )
2025-05-07T20:31:16.4722079Z Trying example: test_gqa(
2025-05-07T20:31:16.4722423Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4722814Z     int4_kv=False,
2025-05-07T20:31:16.4723058Z     num_groups=1,
2025-05-07T20:31:16.4723300Z     B=70,
2025-05-07T20:31:16.4723521Z     MAX_T=94,
2025-05-07T20:31:16.4723743Z     N_H_L=78,
2025-05-07T20:31:16.4723979Z )
2025-05-07T20:31:16.4724210Z Trying example: test_gqa(
2025-05-07T20:31:16.4724548Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4724932Z     int4_kv=False,
2025-05-07T20:31:16.4725181Z     num_groups=1,
2025-05-07T20:31:16.4725418Z     B=78,
2025-05-07T20:31:16.4725653Z     MAX_T=94,
2025-05-07T20:31:16.4725887Z     N_H_L=78,
2025-05-07T20:31:16.4726105Z )
2025-05-07T20:31:16.4726333Z Trying example: test_gqa(
2025-05-07T20:31:16.4726685Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4727059Z     int4_kv=False,
2025-05-07T20:31:16.4727311Z     num_groups=1,
2025-05-07T20:31:16.4727550Z     B=94,
2025-05-07T20:31:16.4737817Z     MAX_T=94,
2025-05-07T20:31:16.4738033Z     N_H_L=78,
2025-05-07T20:31:16.4738243Z )
2025-05-07T20:31:16.4738437Z Trying example: test_gqa(
2025-05-07T20:31:16.4738746Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4739071Z     int4_kv=False,
2025-05-07T20:31:16.4739414Z     num_groups=1,
2025-05-07T20:31:16.4739622Z     B=94,
2025-05-07T20:31:16.4739799Z     MAX_T=94,
2025-05-07T20:31:16.4739994Z     N_H_L=94,
2025-05-07T20:31:16.4740185Z )
2025-05-07T20:31:16.4740368Z Trying example: test_gqa(
2025-05-07T20:31:16.4740673Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4741000Z     int4_kv=False,
2025-05-07T20:31:16.4741206Z     num_groups=4,
2025-05-07T20:31:16.4741416Z     B=41,
2025-05-07T20:31:16.4741604Z     MAX_T=105,
2025-05-07T20:31:16.4741796Z     N_H_L=126,
2025-05-07T20:31:16.4741988Z )
2025-05-07T20:31:16.4742175Z Trying example: test_gqa(
2025-05-07T20:31:16.4742471Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4742785Z     int4_kv=False,
2025-05-07T20:31:16.4742995Z     num_groups=4,
2025-05-07T20:31:16.4743208Z     B=105,
2025-05-07T20:31:16.4743387Z     MAX_T=105,
2025-05-07T20:31:16.4743586Z     N_H_L=126,
2025-05-07T20:31:16.4743780Z )
2025-05-07T20:31:16.4743961Z Trying example: test_gqa(
2025-05-07T20:31:16.4744253Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4744564Z     int4_kv=False,
2025-05-07T20:31:16.4744765Z     num_groups=4,
2025-05-07T20:31:16.4744964Z     B=105,
2025-05-07T20:31:16.4745141Z     MAX_T=105,
2025-05-07T20:31:16.4745332Z     N_H_L=105,
2025-05-07T20:31:16.4745532Z )
2025-05-07T20:31:16.4745740Z Trying example: test_gqa(
2025-05-07T20:31:16.4746040Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4746354Z     int4_kv=True,
2025-05-07T20:31:16.4746556Z     num_groups=1,
2025-05-07T20:31:16.4746759Z     B=95,
2025-05-07T20:31:16.4746940Z     MAX_T=114,
2025-05-07T20:31:16.4747128Z     N_H_L=43,
2025-05-07T20:31:16.4747314Z )
2025-05-07T20:31:16.4747505Z Trying example: test_gqa(
2025-05-07T20:31:16.4747885Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4748204Z     int4_kv=True,
2025-05-07T20:31:16.4748410Z     num_groups=1,
2025-05-07T20:31:16.4748605Z     B=43,
2025-05-07T20:31:16.4748785Z     MAX_T=114,
2025-05-07T20:31:16.4748987Z     N_H_L=43,
2025-05-07T20:31:16.4749166Z )
2025-05-07T20:31:16.4749350Z Trying example: test_gqa(
2025-05-07T20:31:16.4749634Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4750059Z     int4_kv=True,
2025-05-07T20:31:16.4750258Z     num_groups=1,
2025-05-07T20:31:16.4750457Z     B=43,
2025-05-07T20:31:16.4750635Z     MAX_T=43,
2025-05-07T20:31:16.4750823Z     N_H_L=43,
2025-05-07T20:31:16.4751010Z )
2025-05-07T20:31:16.4751196Z Trying example: test_gqa(
2025-05-07T20:31:16.4751479Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4751797Z     int4_kv=False,
2025-05-07T20:31:16.4752000Z     num_groups=1,
2025-05-07T20:31:16.4752193Z     B=21,
2025-05-07T20:31:16.4752376Z     MAX_T=38,
2025-05-07T20:31:16.4752575Z     N_H_L=42,
2025-05-07T20:31:16.4752756Z )
2025-05-07T20:31:16.4752943Z Trying example: test_gqa(
2025-05-07T20:31:16.4753233Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4753550Z     int4_kv=False,
2025-05-07T20:31:16.4753756Z     num_groups=1,
2025-05-07T20:31:16.4753957Z     B=38,
2025-05-07T20:31:16.4754132Z     MAX_T=38,
2025-05-07T20:31:16.4754323Z     N_H_L=42,
2025-05-07T20:31:16.4754508Z )
2025-05-07T20:31:16.4754690Z Trying example: test_gqa(
2025-05-07T20:31:16.4754983Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4755303Z     int4_kv=False,
2025-05-07T20:31:16.4755507Z     num_groups=1,
2025-05-07T20:31:16.4755711Z     B=38,
2025-05-07T20:31:16.4755893Z     MAX_T=42,
2025-05-07T20:31:16.4756073Z     N_H_L=42,
2025-05-07T20:31:16.4756263Z )
2025-05-07T20:31:16.4756452Z Trying example: test_gqa(
2025-05-07T20:31:16.4756735Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4757051Z     int4_kv=False,
2025-05-07T20:31:16.4757266Z     num_groups=1,
2025-05-07T20:31:16.4757458Z     B=42,
2025-05-07T20:31:16.4757636Z     MAX_T=42,
2025-05-07T20:31:16.4757822Z     N_H_L=42,
2025-05-07T20:31:16.4757998Z )
2025-05-07T20:31:16.4758279Z Trying example: test_gqa(
2025-05-07T20:31:16.4758569Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4758882Z     int4_kv=True,
2025-05-07T20:31:16.4759082Z     num_groups=1,
2025-05-07T20:31:16.4759281Z     B=74,
2025-05-07T20:31:16.4759466Z     MAX_T=20,
2025-05-07T20:31:16.4759651Z     N_H_L=15,
2025-05-07T20:31:16.4759837Z )
2025-05-07T20:31:16.4760023Z Trying example: test_gqa(
2025-05-07T20:31:16.4760303Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4760618Z     int4_kv=True,
2025-05-07T20:31:16.4760827Z     num_groups=1,
2025-05-07T20:31:16.4761021Z     B=20,
2025-05-07T20:31:16.4761202Z     MAX_T=20,
2025-05-07T20:31:16.4761394Z     N_H_L=15,
2025-05-07T20:31:16.4761576Z )
2025-05-07T20:31:16.4761762Z Trying example: test_gqa(
2025-05-07T20:31:16.4762049Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4762358Z     int4_kv=True,
2025-05-07T20:31:16.4762561Z     num_groups=1,
2025-05-07T20:31:16.4762758Z     B=20,
2025-05-07T20:31:16.4762939Z     MAX_T=15,
2025-05-07T20:31:16.4763134Z     N_H_L=15,
2025-05-07T20:31:16.4763327Z )
2025-05-07T20:31:16.4763512Z Trying example: test_gqa(
2025-05-07T20:31:16.4763800Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4764118Z     int4_kv=True,
2025-05-07T20:31:16.4764320Z     num_groups=1,
2025-05-07T20:31:16.4764525Z     B=15,
2025-05-07T20:31:16.4764710Z     MAX_T=20,
2025-05-07T20:31:16.4764894Z     N_H_L=15,
2025-05-07T20:31:16.4765081Z )
2025-05-07T20:31:16.4765268Z Trying example: test_gqa(
2025-05-07T20:31:16.4765581Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4765918Z     int4_kv=True,
2025-05-07T20:31:16.4766124Z     num_groups=1,
2025-05-07T20:31:16.4766323Z     B=15,
2025-05-07T20:31:16.4766499Z     MAX_T=15,
2025-05-07T20:31:16.4766783Z     N_H_L=15,
2025-05-07T20:31:16.4766974Z )
2025-05-07T20:31:16.4767153Z Trying example: test_gqa(
2025-05-07T20:31:16.4767439Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4767761Z     int4_kv=False,
2025-05-07T20:31:16.4767962Z     num_groups=4,
2025-05-07T20:31:16.4768161Z     B=117,
2025-05-07T20:31:16.4768343Z     MAX_T=104,
2025-05-07T20:31:16.4768530Z     N_H_L=69,
2025-05-07T20:31:16.4768716Z )
2025-05-07T20:31:16.4768903Z Trying example: test_gqa(
2025-05-07T20:31:16.4769186Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4769508Z     int4_kv=False,
2025-05-07T20:31:16.4769715Z     num_groups=4,
2025-05-07T20:31:16.4769906Z     B=117,
2025-05-07T20:31:16.4770092Z     MAX_T=117,
2025-05-07T20:31:16.4770284Z     N_H_L=69,
2025-05-07T20:31:16.4770461Z )
2025-05-07T20:31:16.4770645Z Trying example: test_gqa(
2025-05-07T20:31:16.4770934Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4771254Z     int4_kv=False,
2025-05-07T20:31:16.4771463Z     num_groups=4,
2025-05-07T20:31:16.4771663Z     B=69,
2025-05-07T20:31:16.4771844Z     MAX_T=117,
2025-05-07T20:31:16.4772040Z     N_H_L=69,
2025-05-07T20:31:16.4772229Z )
2025-05-07T20:31:16.4772423Z Trying example: test_gqa(
2025-05-07T20:31:16.4772705Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:16.4773024Z     int4_kv=False,
2025-05-07T20:31:16.4773232Z     num_groups=4,
2025-05-07T20:31:16.4773427Z     B=117,
2025-05-07T20:31:16.4773609Z     MAX_T=69,
2025-05-07T20:31:16.4773802Z     N_H_L=69,
2025-05-07T20:31:16.4773982Z )
2025-05-07T20:31:16.4774170Z PASSED
2025-05-07T20:31:16.5120976Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:31:16.5121322Z 
2025-05-07T20:31:16.5121965Z =========================== short test summary info ============================
2025-05-07T20:31:16.5122767Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when CUDA is not available or xformers is not available
2025-05-07T20:31:16.5123522Z ======================== 1 passed, 1 skipped in 38.60s =========================
2025-05-07T20:31:17.1647784Z 
2025-05-07T20:31:17.1648453Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:31:17.1669061Z [TEST] Python test time for ./attention/gqa_test.py: 42 seconds
2025-05-07T20:31:17.1669479Z 
2025-05-07T20:31:17.1669486Z 
2025-05-07T20:31:17.1669491Z 
2025-05-07T20:31:17.1669496Z 
2025-05-07T20:31:17.1691638Z ################################################################################
2025-05-07T20:31:17.1707307Z # [2025-05-07T20:31:17.170Z] Run Python Test Suite:
2025-05-07T20:31:17.1707809Z #   ./coalesce/coalesce_test.py
2025-05-07T20:31:17.1708205Z ################################################################################
2025-05-07T20:31:17.1733223Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:31:17.1734032Z 
2025-05-07T20:31:19.3474705Z ============================= test session starts ==============================
2025-05-07T20:31:19.3475440Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:19.3476085Z cachedir: .pytest_cache
2025-05-07T20:31:19.3477354Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:19.3478895Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:19.3479740Z plugins: hypothesis-6.131.14
2025-05-07T20:31:20.9050470Z collecting ... collected 1 item
2025-05-07T20:31:20.9050909Z 
2025-05-07T20:31:21.6415220Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:31:21.6415662Z 
2025-05-07T20:31:21.6415842Z ============================== 1 passed in 2.43s ===============================
2025-05-07T20:31:22.2824802Z 
2025-05-07T20:31:22.2825244Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:31:22.2846158Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:31:22.2846513Z 
2025-05-07T20:31:22.2846518Z 
2025-05-07T20:31:22.2846522Z 
2025-05-07T20:31:22.2846526Z 
2025-05-07T20:31:22.2866572Z ################################################################################
2025-05-07T20:31:22.2881996Z # [2025-05-07T20:31:22.287Z] Run Python Test Suite:
2025-05-07T20:31:22.2882338Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:31:22.2882626Z ################################################################################
2025-05-07T20:31:22.2907869Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:31:22.2908552Z 
2025-05-07T20:31:24.4461893Z ============================= test session starts ==============================
2025-05-07T20:31:24.4462601Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:24.4463154Z cachedir: .pytest_cache
2025-05-07T20:31:24.4463780Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:24.4464560Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:24.4464989Z plugins: hypothesis-6.131.14
2025-05-07T20:31:26.0472050Z collecting ... collected 5 items
2025-05-07T20:31:26.0472295Z 
2025-05-07T20:31:26.0483743Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:26.0493104Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:26.0501779Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:26.0510422Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:26.0528250Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:26.0528721Z 
2025-05-07T20:31:26.0529229Z =========================== short test summary info ============================
2025-05-07T20:31:26.0530256Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:26.0531667Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:26.0533094Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:26.0534514Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:26.0535932Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:26.0536905Z ============================== 5 skipped in 1.74s ==============================
2025-05-07T20:31:26.5917118Z 
2025-05-07T20:31:26.5917603Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:26.5937774Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds
2025-05-07T20:31:26.5938082Z 
2025-05-07T20:31:26.5938091Z 
2025-05-07T20:31:26.5938095Z 
2025-05-07T20:31:26.5938108Z 
2025-05-07T20:31:26.5958301Z ################################################################################
2025-05-07T20:31:26.5973602Z # [2025-05-07T20:31:26.597Z] Run Python Test Suite:
2025-05-07T20:31:26.5974027Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:26.5974461Z ################################################################################
2025-05-07T20:31:26.6000238Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:26.6001181Z 
2025-05-07T20:31:28.7542109Z ============================= test session starts ==============================
2025-05-07T20:31:28.7542825Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:28.7543389Z cachedir: .pytest_cache
2025-05-07T20:31:28.7544002Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:28.7544782Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:28.7545210Z plugins: hypothesis-6.131.14
2025-05-07T20:31:30.4353044Z collecting ... collected 2 items
2025-05-07T20:31:30.4353316Z 
2025-05-07T20:31:30.4365132Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:31:30.4379953Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:31:30.4380439Z 
2025-05-07T20:31:30.4380607Z =========================== short test summary info ============================
2025-05-07T20:31:30.4381273Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:30.4382164Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:30.4383059Z ============================== 2 skipped in 1.82s ==============================
2025-05-07T20:31:30.9999584Z 
2025-05-07T20:31:30.9999962Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:31.0021099Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds
2025-05-07T20:31:31.0021452Z 
2025-05-07T20:31:31.0021456Z 
2025-05-07T20:31:31.0021461Z 
2025-05-07T20:31:31.0021485Z 
2025-05-07T20:31:31.0043119Z ################################################################################
2025-05-07T20:31:31.0058684Z # [2025-05-07T20:31:31.005Z] Run Python Test Suite:
2025-05-07T20:31:31.0060100Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:31:31.0060405Z ################################################################################
2025-05-07T20:31:31.0083554Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:31:31.0084217Z 
2025-05-07T20:31:33.1570462Z ============================= test session starts ==============================
2025-05-07T20:31:33.1571289Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:33.1571856Z cachedir: .pytest_cache
2025-05-07T20:31:33.1572470Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:33.1573278Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:33.1573712Z plugins: hypothesis-6.131.14
2025-05-07T20:31:34.7358208Z collecting ... collected 4 items
2025-05-07T20:31:34.7358545Z 
2025-05-07T20:31:37.7740394Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:31:37.7903226Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:31:37.8096719Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:31:37.8257335Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:31:37.8257892Z 
2025-05-07T20:31:37.8258076Z =========================== short test summary info ============================
2025-05-07T20:31:37.8258838Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when H100 is not available or MI300 is not available
2025-05-07T20:31:37.8260220Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when xformers is not available
2025-05-07T20:31:37.8260871Z ============================== 4 skipped in 4.80s ==============================
2025-05-07T20:31:39.5401834Z 
2025-05-07T20:31:39.5402311Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:31:39.5423088Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds
2025-05-07T20:31:39.5423397Z 
2025-05-07T20:31:39.5423402Z 
2025-05-07T20:31:39.5423406Z 
2025-05-07T20:31:39.5423409Z 
2025-05-07T20:31:39.5442468Z ################################################################################
2025-05-07T20:31:39.5457579Z # [2025-05-07T20:31:39.545Z] Run Python Test Suite:
2025-05-07T20:31:39.5457913Z #   ./moe/activation_test.py
2025-05-07T20:31:39.5458198Z ################################################################################
2025-05-07T20:31:39.5483886Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:31:39.5484539Z 
2025-05-07T20:31:41.7055721Z ============================= test session starts ==============================
2025-05-07T20:31:41.7056410Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:41.7056962Z cachedir: .pytest_cache
2025-05-07T20:31:41.7057588Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:41.7058369Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:41.7058797Z plugins: hypothesis-6.131.14
2025-05-07T20:31:43.3624798Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:43.5767842Z collecting ... collected 2 items
2025-05-07T20:31:43.5768086Z 
2025-05-07T20:31:49.5708350Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:31:49.5709641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5711044Z     T=1,
2025-05-07T20:31:49.5711423Z     D=5120,
2025-05-07T20:31:49.5711820Z     contiguous=True,
2025-05-07T20:31:49.5712264Z     compiled=True,
2025-05-07T20:31:49.5712661Z )
2025-05-07T20:31:49.5713172Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5714088Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5714649Z     T=4096,
2025-05-07T20:31:49.5714907Z     D=5120,
2025-05-07T20:31:49.5715173Z     contiguous=True,
2025-05-07T20:31:49.5715472Z     compiled=True,
2025-05-07T20:31:49.5715707Z )
2025-05-07T20:31:49.5715902Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5716280Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5716678Z     T=4096,
2025-05-07T20:31:49.5716882Z     D=7168,
2025-05-07T20:31:49.5717077Z     contiguous=False,
2025-05-07T20:31:49.5717300Z     compiled=False,
2025-05-07T20:31:49.5717506Z )
2025-05-07T20:31:49.5717702Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5718086Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5718478Z     T=4096,
2025-05-07T20:31:49.5718662Z     D=5120,
2025-05-07T20:31:49.5718851Z     contiguous=False,
2025-05-07T20:31:49.5719078Z     compiled=True,
2025-05-07T20:31:49.5719283Z )
2025-05-07T20:31:49.5719472Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5719856Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5720257Z     T=1,
2025-05-07T20:31:49.5720432Z     D=7168,
2025-05-07T20:31:49.5720630Z     contiguous=True,
2025-05-07T20:31:49.5720853Z     compiled=True,
2025-05-07T20:31:49.5721048Z )
2025-05-07T20:31:49.5721242Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5721622Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5722209Z     T=1,
2025-05-07T20:31:49.5722395Z     D=7168,
2025-05-07T20:31:49.5722591Z     contiguous=False,
2025-05-07T20:31:49.5722821Z     compiled=True,
2025-05-07T20:31:49.5723027Z )
2025-05-07T20:31:49.5723225Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5723606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5723997Z     T=4096,
2025-05-07T20:31:49.5724184Z     D=5120,
2025-05-07T20:31:49.5724381Z     contiguous=False,
2025-05-07T20:31:49.5724604Z     compiled=False,
2025-05-07T20:31:49.5724811Z )
2025-05-07T20:31:49.5725009Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5725384Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5727298Z     T=1,
2025-05-07T20:31:49.5727485Z     D=7168,
2025-05-07T20:31:49.5727681Z     contiguous=True,
2025-05-07T20:31:49.5727906Z     compiled=False,
2025-05-07T20:31:49.5728114Z )
2025-05-07T20:31:49.5728303Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5728692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5729087Z     T=2048,
2025-05-07T20:31:49.5729273Z     D=5120,
2025-05-07T20:31:49.5729471Z     contiguous=True,
2025-05-07T20:31:49.5729709Z     compiled=True,
2025-05-07T20:31:49.5729911Z )
2025-05-07T20:31:49.5730111Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5730496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5730895Z     T=2048,
2025-05-07T20:31:49.5731078Z     D=7168,
2025-05-07T20:31:49.5731273Z     contiguous=True,
2025-05-07T20:31:49.5731501Z     compiled=True,
2025-05-07T20:31:49.5731700Z )
2025-05-07T20:31:49.5731900Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5732284Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5732673Z     T=2048,
2025-05-07T20:31:49.5740988Z     D=7168,
2025-05-07T20:31:49.5741200Z     contiguous=True,
2025-05-07T20:31:49.5741448Z     compiled=False,
2025-05-07T20:31:49.5741665Z )
2025-05-07T20:31:49.5741861Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5742263Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5742798Z     T=128,
2025-05-07T20:31:49.5742991Z     D=5120,
2025-05-07T20:31:49.5743194Z     contiguous=False,
2025-05-07T20:31:49.5743428Z     compiled=True,
2025-05-07T20:31:49.5743635Z )
2025-05-07T20:31:49.5743843Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5744240Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5744651Z     T=128,
2025-05-07T20:31:49.5744845Z     D=5120,
2025-05-07T20:31:49.5745048Z     contiguous=True,
2025-05-07T20:31:49.5745287Z     compiled=True,
2025-05-07T20:31:49.5745491Z )
2025-05-07T20:31:49.5745700Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5746094Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5746491Z     T=16384,
2025-05-07T20:31:49.5746705Z     D=5120,
2025-05-07T20:31:49.5746910Z     contiguous=False,
2025-05-07T20:31:49.5747144Z     compiled=True,
2025-05-07T20:31:49.5747358Z )
2025-05-07T20:31:49.5747562Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5747951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5748349Z     T=16384,
2025-05-07T20:31:49.5748543Z     D=5120,
2025-05-07T20:31:49.5748744Z     contiguous=False,
2025-05-07T20:31:49.5748975Z     compiled=False,
2025-05-07T20:31:49.5749188Z )
2025-05-07T20:31:49.5749386Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5749912Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5750317Z     T=128,
2025-05-07T20:31:49.5750501Z     D=7168,
2025-05-07T20:31:49.5750702Z     contiguous=True,
2025-05-07T20:31:49.5750932Z     compiled=False,
2025-05-07T20:31:49.5751139Z )
2025-05-07T20:31:49.5751340Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5751724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5752213Z     T=128,
2025-05-07T20:31:49.5752398Z     D=7168,
2025-05-07T20:31:49.5752590Z     contiguous=False,
2025-05-07T20:31:49.5752812Z     compiled=False,
2025-05-07T20:31:49.5753023Z )
2025-05-07T20:31:49.5753219Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5753614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5754032Z     T=1,
2025-05-07T20:31:49.5754216Z     D=5120,
2025-05-07T20:31:49.5754406Z     contiguous=False,
2025-05-07T20:31:49.5754634Z     compiled=False,
2025-05-07T20:31:49.5754840Z )
2025-05-07T20:31:49.5755029Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5755406Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5755799Z     T=1,
2025-05-07T20:31:49.5755985Z     D=7168,
2025-05-07T20:31:49.5756176Z     contiguous=False,
2025-05-07T20:31:49.5756401Z     compiled=False,
2025-05-07T20:31:49.5756606Z )
2025-05-07T20:31:49.5756806Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5757182Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5757574Z     T=4096,
2025-05-07T20:31:49.5757753Z     D=5120,
2025-05-07T20:31:49.5757952Z     contiguous=True,
2025-05-07T20:31:49.5758184Z     compiled=False,
2025-05-07T20:31:49.5758381Z )
2025-05-07T20:31:49.5758577Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5758957Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5759341Z     T=128,
2025-05-07T20:31:49.5759537Z     D=7168,
2025-05-07T20:31:49.5759730Z     contiguous=True,
2025-05-07T20:31:49.5759950Z     compiled=True,
2025-05-07T20:31:49.5760155Z )
2025-05-07T20:31:49.5760356Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5760726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5761119Z     T=1,
2025-05-07T20:31:49.5761309Z     D=5120,
2025-05-07T20:31:49.5761501Z     contiguous=False,
2025-05-07T20:31:49.5761735Z     compiled=True,
2025-05-07T20:31:49.5761939Z )
2025-05-07T20:31:49.5762136Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5762507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5762996Z     T=4096,
2025-05-07T20:31:49.5763184Z     D=7168,
2025-05-07T20:31:49.5763371Z     contiguous=True,
2025-05-07T20:31:49.5763595Z     compiled=False,
2025-05-07T20:31:49.5763801Z )
2025-05-07T20:31:49.5764002Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5764419Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5764810Z     T=4096,
2025-05-07T20:31:49.5764990Z     D=7168,
2025-05-07T20:31:49.5765186Z     contiguous=False,
2025-05-07T20:31:49.5765410Z     compiled=True,
2025-05-07T20:31:49.5765606Z )
2025-05-07T20:31:49.5765800Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5766174Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5766561Z     T=128,
2025-05-07T20:31:49.5766755Z     D=5120,
2025-05-07T20:31:49.5766947Z     contiguous=True,
2025-05-07T20:31:49.5767163Z     compiled=False,
2025-05-07T20:31:49.5767370Z )
2025-05-07T20:31:49.5767562Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5767940Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5768334Z     T=128,
2025-05-07T20:31:49.5768518Z     D=5120,
2025-05-07T20:31:49.5768716Z     contiguous=False,
2025-05-07T20:31:49.5768935Z     compiled=False,
2025-05-07T20:31:49.5769137Z )
2025-05-07T20:31:49.5769329Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5769698Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5770092Z     T=1,
2025-05-07T20:31:49.5770303Z     D=5120,
2025-05-07T20:31:49.5770496Z     contiguous=True,
2025-05-07T20:31:49.5770714Z     compiled=False,
2025-05-07T20:31:49.5770922Z )
2025-05-07T20:31:49.5771114Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5771487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5771974Z     T=2048,
2025-05-07T20:31:49.5772160Z     D=7168,
2025-05-07T20:31:49.5772351Z     contiguous=False,
2025-05-07T20:31:49.5772578Z     compiled=True,
2025-05-07T20:31:49.5772789Z )
2025-05-07T20:31:49.5772983Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5773366Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5773763Z     T=2048,
2025-05-07T20:31:49.5773946Z     D=7168,
2025-05-07T20:31:49.5774143Z     contiguous=False,
2025-05-07T20:31:49.5774369Z     compiled=False,
2025-05-07T20:31:49.5774569Z )
2025-05-07T20:31:49.5774765Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5775143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5775537Z     T=16384,
2025-05-07T20:31:49.5775725Z     D=7168,
2025-05-07T20:31:49.5775917Z     contiguous=False,
2025-05-07T20:31:49.5776140Z     compiled=True,
2025-05-07T20:31:49.5776340Z )
2025-05-07T20:31:49.5776541Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5776919Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5777311Z     T=16384,
2025-05-07T20:31:49.5777506Z     D=7168,
2025-05-07T20:31:49.5777704Z     contiguous=True,
2025-05-07T20:31:49.5777920Z     compiled=True,
2025-05-07T20:31:49.5778126Z )
2025-05-07T20:31:49.5778324Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5778697Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5779093Z     T=4096,
2025-05-07T20:31:49.5779278Z     D=7168,
2025-05-07T20:31:49.5779466Z     contiguous=True,
2025-05-07T20:31:49.5779690Z     compiled=True,
2025-05-07T20:31:49.5779893Z )
2025-05-07T20:31:49.5780083Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5780461Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5780859Z     T=2048,
2025-05-07T20:31:49.5781042Z     D=5120,
2025-05-07T20:31:49.5781237Z     contiguous=False,
2025-05-07T20:31:49.5781472Z     compiled=False,
2025-05-07T20:31:49.5781677Z )
2025-05-07T20:31:49.5781866Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5782344Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5783082Z     T=2048,
2025-05-07T20:31:49.5783327Z     D=5120,
2025-05-07T20:31:49.5783523Z     contiguous=True,
2025-05-07T20:31:49.5783747Z     compiled=False,
2025-05-07T20:31:49.5783950Z )
2025-05-07T20:31:49.5784148Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5784531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5784919Z     T=128,
2025-05-07T20:31:49.5785108Z     D=7168,
2025-05-07T20:31:49.5785305Z     contiguous=False,
2025-05-07T20:31:49.5785526Z     compiled=True,
2025-05-07T20:31:49.5785729Z )
2025-05-07T20:31:49.5785928Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5786305Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5786700Z     T=16384,
2025-05-07T20:31:49.5786905Z     D=5120,
2025-05-07T20:31:49.5787090Z     contiguous=True,
2025-05-07T20:31:49.5787315Z     compiled=True,
2025-05-07T20:31:49.5787521Z )
2025-05-07T20:31:49.5787726Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5788224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5788621Z     T=2048,
2025-05-07T20:31:49.5788810Z     D=5120,
2025-05-07T20:31:49.5789006Z     contiguous=False,
2025-05-07T20:31:49.5789230Z     compiled=True,
2025-05-07T20:31:49.5789432Z )
2025-05-07T20:31:49.5789631Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5790115Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5790514Z     T=16384,
2025-05-07T20:31:49.5790708Z     D=5120,
2025-05-07T20:31:49.5790905Z     contiguous=True,
2025-05-07T20:31:49.5791127Z     compiled=False,
2025-05-07T20:31:49.5791335Z )
2025-05-07T20:31:49.5791534Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5791920Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5792501Z     T=16384,
2025-05-07T20:31:49.5792695Z     D=7168,
2025-05-07T20:31:49.5792885Z     contiguous=False,
2025-05-07T20:31:49.5793116Z     compiled=False,
2025-05-07T20:31:49.5793325Z )
2025-05-07T20:31:49.5793513Z Trying example: test_silu_mul(
2025-05-07T20:31:49.5793894Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:49.5794292Z     T=16384,
2025-05-07T20:31:49.5794482Z     D=7168,
2025-05-07T20:31:49.5794670Z     contiguous=True,
2025-05-07T20:31:49.5794894Z     compiled=False,
2025-05-07T20:31:49.5795098Z )
2025-05-07T20:31:49.5795265Z PASSED
2025-05-07T20:31:49.6407422Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:49.6408616Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:49.6410142Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:49.6411752Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:49.6413276Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:49.6414817Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.6416649Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.6418178Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.6419741Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:49.6421117Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:49.6422455Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:49.6423808Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:49.6424946Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:49.6426067Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:49.6427406Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:49.6428825Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:49.6430448Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:49.6431596Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:49.6432891Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:49.6434388Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:49.6435599Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.6436602Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.6437403Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:49.6438514Z W0507 20:31:49.639295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.6581195Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:49.6582361Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:49.6584477Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:49.6586067Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:49.6587590Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:49.6589110Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.6590660Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.6592188Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.6593750Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:49.6595121Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:49.6596464Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:49.6597934Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:49.6599076Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:49.6600189Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:49.6601525Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:49.6602931Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:49.6604163Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:49.6605299Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:49.6606590Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:49.6608082Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:49.6609235Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.6610227Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.6611153Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:49.6612274Z W0507 20:31:49.657546 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.7007607Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:49.7008774Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:49.7010253Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:49.7011871Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:49.7013397Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:49.7014923Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.7016356Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.7018174Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.7019738Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:49.7021114Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:49.7022453Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:49.7023791Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:49.7024922Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:49.7026040Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:49.7027381Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:49.7028795Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:49.7030141Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:49.7031436Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:49.7032738Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:49.7034238Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:49.7035442Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.7036426Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.7037232Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:49.7038347Z W0507 20:31:49.700131 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:49.7049590Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:49.7050750Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:49.7052210Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:49.7053926Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:49.7055452Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:49.7056975Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:49.7058404Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:49.7059922Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:49.7061470Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:49.7062838Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:49.7064177Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:49.7065554Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:49.7067124Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:49.7068239Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:49.7069579Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:49.7071140Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:49.7072366Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:49.7073515Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:49.7074815Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:49.7076312Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:49.7077479Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:49.7078471Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:49.7079362Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:49.7080491Z W0507 20:31:49.704468 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.2147370Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:31:50.2148107Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:50.2148553Z     T=1,
2025-05-07T20:31:50.2148756Z     D=5120,
2025-05-07T20:31:50.2148961Z     scale_ub=None,
2025-05-07T20:31:50.2149183Z     contiguous=True,
2025-05-07T20:31:50.2149413Z     compiled=True,
2025-05-07T20:31:50.2149636Z )
2025-05-07T20:31:50.2150127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:50.2150686Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:50.2150976Z 
2025-05-07T20:31:50.2151057Z     @given(
2025-05-07T20:31:50.2151309Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:50.2151640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:50.2151961Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:50.2152311Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:50.2152652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:50.2152950Z     )
2025-05-07T20:31:50.2153346Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:50.2153830Z     def test_silu_mul_quant(
2025-05-07T20:31:50.2154078Z         self,
2025-05-07T20:31:50.2154279Z         T: int,
2025-05-07T20:31:50.2154482Z         D: int,
2025-05-07T20:31:50.2154698Z         scale_ub: Optional[float],
2025-05-07T20:31:50.2154984Z         contiguous: bool,
2025-05-07T20:31:50.2155237Z         compiled: bool,
2025-05-07T20:31:50.2155472Z     ) -> None:
2025-05-07T20:31:50.2155681Z         torch.manual_seed(2025)
2025-05-07T20:31:50.2155936Z     
2025-05-07T20:31:50.2157348Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:50.2157717Z     
2025-05-07T20:31:50.2157912Z         x_sign = torch.sign(x)
2025-05-07T20:31:50.2158215Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:50.2158535Z         x = x_sign * x_clamp
2025-05-07T20:31:50.2158786Z         x0 = x[:, :D]
2025-05-07T20:31:50.2159011Z         x1 = x[:, D:]
2025-05-07T20:31:50.2159219Z     
2025-05-07T20:31:50.2159409Z         if contiguous:
2025-05-07T20:31:50.2159651Z             x0 = x0.contiguous()
2025-05-07T20:31:50.2159910Z             x1 = x1.contiguous()
2025-05-07T20:31:50.2160159Z     
2025-05-07T20:31:50.2160353Z         if scale_ub is not None:
2025-05-07T20:31:50.2160629Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:50.2160986Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:50.2161309Z             )
2025-05-07T20:31:50.2161505Z         else:
2025-05-07T20:31:50.2161711Z             scale_ub_tensor = None
2025-05-07T20:31:50.2161977Z     
2025-05-07T20:31:50.2162224Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:50.2162556Z             op = silu_mul_quant
2025-05-07T20:31:50.2162807Z             if compiled:
2025-05-07T20:31:50.2163062Z                 op = torch.compile(op)
2025-05-07T20:31:50.2163376Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:50.2163661Z     
2025-05-07T20:31:50.2163863Z         y_fp8, y_scale = fn()
2025-05-07T20:31:50.2164159Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:50.2164456Z     
2025-05-07T20:31:50.2164695Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:50.2165047Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:50.2173687Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:50.2174240Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:50.2174617Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:50.2174947Z     
2025-05-07T20:31:50.2175159Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:50.2175368Z 
2025-05-07T20:31:50.2175477Z moe/activation_test.py:126: 
2025-05-07T20:31:50.2175781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:50.2176135Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:50.2176477Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:50.2177330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:50.2178157Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:50.2178742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:50.2179649Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:50.2180396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:50.2181177Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:50.2181996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:50.2183226Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:50.2184078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:50.2184771Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:50.2185413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:50.2185975Z     fn()
2025-05-07T20:31:50.2186529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:50.2187323Z     self.fn.run(
2025-05-07T20:31:50.2187829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:50.2188390Z     kernel = self.compile(
2025-05-07T20:31:50.2188963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:50.2189878Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.2190425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:50.2190755Z 
2025-05-07T20:31:50.2191035Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f21ad5190>
2025-05-07T20:31:50.2192456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:50.2193998Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f214764c0>}
2025-05-07T20:31:50.2195487Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:50.2196613Z context = <triton._C.libtriton.ir.context object at 0x7f1f21b9ddb0>
2025-05-07T20:31:50.2196920Z 
2025-05-07T20:31:50.2197098Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:50.2197645Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.2198141Z                            module_map=module_map)
2025-05-07T20:31:50.2198676Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.2199038Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:50.2199314Z E       ^
2025-05-07T20:31:50.2199814Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:50.2200304Z 
2025-05-07T20:31:50.2200760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:50.2201320Z 
2025-05-07T20:31:50.2201423Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:50.2201858Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:50.2202281Z     T=2048,
2025-05-07T20:31:50.2202470Z     D=5120,
2025-05-07T20:31:50.2202673Z     scale_ub=1200.0,
2025-05-07T20:31:50.2202906Z     contiguous=True,
2025-05-07T20:31:50.2203139Z     compiled=False,
2025-05-07T20:31:50.2203351Z )
2025-05-07T20:31:50.8081104Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:50.8082327Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:50.8084105Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:50.8085699Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:50.8087226Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:50.8089095Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:50.8090539Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:50.8092054Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:50.8093618Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:50.8094992Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:50.8096338Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:50.8097664Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:50.8098797Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:50.8099908Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:50.8101412Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:50.8102824Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:50.8104046Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:50.8105184Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:50.8106469Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:50.8107974Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:50.8109130Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:50.8110318Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:50.8111117Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:50.8112226Z W0507 20:31:50.803711 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.0130518Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:51.0131995Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:51.0133490Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:51.0135080Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:51.0136620Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:51.0138179Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.0139636Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:51.0141171Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.0142745Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:51.0144336Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:51.0145686Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:51.0147017Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:51.0148151Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:51.0149266Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:51.0150773Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:51.0152208Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:51.0153438Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:51.0154588Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:51.0155888Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:51.0157402Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:51.0158685Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.0159672Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:51.0160471Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:51.0161582Z W0507 20:31:51.009091 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.5721911Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:51.5723126Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:51.5724674Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:51.5726266Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:51.5727799Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:51.5729640Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.5731096Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:51.5732618Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.5734182Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:51.5735561Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:51.5736919Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:51.5738259Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:51.5739399Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:51.5740513Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:51.5741864Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:51.5743416Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:51.5744643Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:51.5745788Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:51.5747079Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:51.5748581Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:51.5749872Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.5750863Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:51.5751667Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:51.5752776Z W0507 20:31:51.568174 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:51.6111951Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:51.6113276Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:31:51.6114750Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:51.6116322Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:51.6117834Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:51.6119361Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:51.6120802Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:51.6122316Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:51.6123879Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:51.6125252Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:31:51.6126709Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:51.6128043Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:31:51.6129181Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:51.6130302Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:31:51.6131635Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:51.6133052Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:51.6134279Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:51.6135415Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:31:51.6136705Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:51.6138189Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:51.6139435Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:51.6140423Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:51.6141226Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:31:51.6142331Z W0507 20:31:51.607339 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.3742900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:52.3744289Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:31:52.3744797Z 
2025-05-07T20:31:52.3744884Z     @given(
2025-05-07T20:31:52.3745140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:52.3745467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:52.3745793Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:52.3746125Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:52.3746463Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:52.3746760Z     )
2025-05-07T20:31:52.3747126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:52.3747609Z     def test_silu_mul_quant(
2025-05-07T20:31:52.3747851Z         self,
2025-05-07T20:31:52.3748047Z         T: int,
2025-05-07T20:31:52.3748248Z         D: int,
2025-05-07T20:31:52.3748460Z         scale_ub: Optional[float],
2025-05-07T20:31:52.3748737Z         contiguous: bool,
2025-05-07T20:31:52.3748977Z         compiled: bool,
2025-05-07T20:31:52.3749209Z     ) -> None:
2025-05-07T20:31:52.3749429Z         torch.manual_seed(2025)
2025-05-07T20:31:52.3749779Z     
2025-05-07T20:31:52.3750055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:52.3750418Z     
2025-05-07T20:31:52.3750611Z         x_sign = torch.sign(x)
2025-05-07T20:31:52.3751247Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:52.3751567Z         x = x_sign * x_clamp
2025-05-07T20:31:52.3751814Z         x0 = x[:, :D]
2025-05-07T20:31:52.3752036Z         x1 = x[:, D:]
2025-05-07T20:31:52.3752243Z     
2025-05-07T20:31:52.3752428Z         if contiguous:
2025-05-07T20:31:52.3752658Z             x0 = x0.contiguous()
2025-05-07T20:31:52.3752914Z             x1 = x1.contiguous()
2025-05-07T20:31:52.3753154Z     
2025-05-07T20:31:52.3753342Z         if scale_ub is not None:
2025-05-07T20:31:52.3753613Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:52.3753957Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:52.3754274Z             )
2025-05-07T20:31:52.3754460Z         else:
2025-05-07T20:31:52.3754674Z             scale_ub_tensor = None
2025-05-07T20:31:52.3754930Z     
2025-05-07T20:31:52.3755156Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:52.3755488Z             op = silu_mul_quant
2025-05-07T20:31:52.3755742Z             if compiled:
2025-05-07T20:31:52.3755987Z                 op = torch.compile(op)
2025-05-07T20:31:52.3756292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:52.3756580Z     
2025-05-07T20:31:52.3756779Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:52.3756945Z 
2025-05-07T20:31:52.3757047Z moe/activation_test.py:117: 
2025-05-07T20:31:52.3757347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.3757695Z moe/activation_test.py:115: in fn
2025-05-07T20:31:52.3757978Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:52.3758726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:52.3759827Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:52.3760402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:52.3761142Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:52.3761858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:52.3762430Z     kernel = self.compile(
2025-05-07T20:31:52.3763000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:52.3763711Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.3764128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.3764386Z 
2025-05-07T20:31:52.3764643Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f21aef3d0>
2025-05-07T20:31:52.3765828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:52.3767433Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f214d3ca0>}
2025-05-07T20:31:52.3768910Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:52.3770045Z context = <triton._C.libtriton.ir.context object at 0x7f1f213eca70>
2025-05-07T20:31:52.3770422Z 
2025-05-07T20:31:52.3770664Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:52.3771218Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.3771719Z                            module_map=module_map)
2025-05-07T20:31:52.3772096Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.3772565Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:52.3772829Z E       ^
2025-05-07T20:31:52.3773329Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.3773821Z 
2025-05-07T20:31:52.3774280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:52.3774838Z 
2025-05-07T20:31:52.3774950Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:52.3775376Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:52.3775797Z     T=2048,
2025-05-07T20:31:52.3775987Z     D=5120,
2025-05-07T20:31:52.3776179Z     scale_ub=1200.0,
2025-05-07T20:31:52.3776402Z     contiguous=True,
2025-05-07T20:31:52.3776631Z     compiled=True,
2025-05-07T20:31:52.3776834Z )
2025-05-07T20:31:52.3777161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:52.3777688Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:31:52.3777978Z 
2025-05-07T20:31:52.3778057Z     @given(
2025-05-07T20:31:52.3778294Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:52.3778622Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:52.3778945Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:52.3779290Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:52.3779637Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:52.3779942Z     )
2025-05-07T20:31:52.3780305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:52.3780774Z     def test_silu_mul_quant(
2025-05-07T20:31:52.3781022Z         self,
2025-05-07T20:31:52.3781218Z         T: int,
2025-05-07T20:31:52.3781506Z         D: int,
2025-05-07T20:31:52.3781734Z         scale_ub: Optional[float],
2025-05-07T20:31:52.3782004Z         contiguous: bool,
2025-05-07T20:31:52.3782254Z         compiled: bool,
2025-05-07T20:31:52.3782495Z     ) -> None:
2025-05-07T20:31:52.3782714Z         torch.manual_seed(2025)
2025-05-07T20:31:52.3783323Z     
2025-05-07T20:31:52.3783612Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:52.3783975Z     
2025-05-07T20:31:52.3784169Z         x_sign = torch.sign(x)
2025-05-07T20:31:52.3784470Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:52.3784825Z         x = x_sign * x_clamp
2025-05-07T20:31:52.3785086Z         x0 = x[:, :D]
2025-05-07T20:31:52.3785307Z         x1 = x[:, D:]
2025-05-07T20:31:52.3785520Z     
2025-05-07T20:31:52.3785703Z         if contiguous:
2025-05-07T20:31:52.3785948Z             x0 = x0.contiguous()
2025-05-07T20:31:52.3786221Z             x1 = x1.contiguous()
2025-05-07T20:31:52.3786474Z     
2025-05-07T20:31:52.3786675Z         if scale_ub is not None:
2025-05-07T20:31:52.3786966Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:52.3787318Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:52.3787643Z             )
2025-05-07T20:31:52.3787841Z         else:
2025-05-07T20:31:52.3788055Z             scale_ub_tensor = None
2025-05-07T20:31:52.3788320Z     
2025-05-07T20:31:52.3788563Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:52.3788891Z             op = silu_mul_quant
2025-05-07T20:31:52.3789159Z             if compiled:
2025-05-07T20:31:52.3789421Z                 op = torch.compile(op)
2025-05-07T20:31:52.3789837Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:52.3790118Z     
2025-05-07T20:31:52.3790309Z         y_fp8, y_scale = fn()
2025-05-07T20:31:52.3790603Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:52.3790896Z     
2025-05-07T20:31:52.3791134Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:52.3791484Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:52.3791782Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:52.3792298Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:52.3792683Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:52.3793010Z     
2025-05-07T20:31:52.3793216Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:52.3793429Z 
2025-05-07T20:31:52.3793527Z moe/activation_test.py:126: 
2025-05-07T20:31:52.3793836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.3794181Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:52.3794518Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:52.3795374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:52.3796186Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:52.3796767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:52.3797510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:52.3798248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:52.3799017Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:52.3799829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:52.3800638Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:52.3801425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:52.3802230Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:52.3802872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:52.3803434Z     fn()
2025-05-07T20:31:52.3803969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:52.3804592Z     self.fn.run(
2025-05-07T20:31:52.3805090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:52.3805665Z     kernel = self.compile(
2025-05-07T20:31:52.3806240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:52.3806946Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.3807372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:52.3807621Z 
2025-05-07T20:31:52.3807841Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f21b503a0>
2025-05-07T20:31:52.3809022Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:52.3810530Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f21ae3af0>}
2025-05-07T20:31:52.3812002Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:52.3813116Z context = <triton._C.libtriton.ir.context object at 0x7f1ee0468e70>
2025-05-07T20:31:52.3813428Z 
2025-05-07T20:31:52.3813598Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:52.3814160Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.3814659Z                            module_map=module_map)
2025-05-07T20:31:52.3815149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.3815513Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:52.3815793Z E       ^
2025-05-07T20:31:52.3816288Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.3816778Z 
2025-05-07T20:31:52.3817227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:52.3817789Z 
2025-05-07T20:31:52.3817891Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:52.3818317Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:52.3818739Z     T=16384,
2025-05-07T20:31:52.3818933Z     D=7168,
2025-05-07T20:31:52.3819135Z     scale_ub=1200.0,
2025-05-07T20:31:52.3819364Z     contiguous=False,
2025-05-07T20:31:52.3819592Z     compiled=False,
2025-05-07T20:31:52.3819797Z )
2025-05-07T20:31:52.8111067Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:52.8112255Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:52.8113734Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:52.8115323Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:52.8117189Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:52.8118724Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.8120169Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:52.8121685Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.8123251Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:52.8124631Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:52.8125978Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:52.8127311Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:52.8128451Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:52.8129577Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:52.8131074Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:52.8132495Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:52.8133719Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:52.8134861Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:52.8136160Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:52.8137662Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:52.8138825Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.8139816Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:52.8140622Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:52.8142026Z W0507 20:31:52.807044 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:52.9744870Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:52.9754108Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:52.9755855Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:52.9758564Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:52.9760331Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:52.9761874Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:52.9763320Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:52.9764847Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:52.9766422Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:52.9768076Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:52.9769419Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:52.9770750Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:52.9771890Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:52.9773014Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:52.9774503Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:52.9775915Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:52.9777143Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:52.9778291Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:52.9779597Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:52.9781272Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:52.9782428Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:52.9783846Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:52.9784653Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:52.9786056Z W0507 20:31:52.970502 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.4705209Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:53.4706393Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:53.4707864Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:53.4709452Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:53.4711083Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:53.4712934Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.4714378Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:53.4715899Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.4717455Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:53.4718838Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:53.4720178Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:53.4721503Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:53.4722630Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:53.4723747Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:53.4725247Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:53.4726656Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:53.4727874Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:53.4729009Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:53.4730307Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:53.4731816Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:53.4732978Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.4733967Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.4734762Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:53.4735879Z W0507 20:31:53.466465 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:53.5097822Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:53.5099147Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:31:53.5100731Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:53.5102289Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:53.5103812Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:53.5105349Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:53.5106784Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:53.5108296Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:53.5109964Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:53.5111468Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:31:53.5112814Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:53.5114145Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:31:53.5115333Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:53.5116450Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:31:53.5117783Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:53.5119204Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:53.5120426Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:53.5121564Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:31:53.5122855Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:53.5124362Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:53.5125926Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:53.5126921Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:53.5127722Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:31:53.5128829Z W0507 20:31:53.505916 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.9858418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.9859309Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:31:54.9859732Z 
2025-05-07T20:31:54.9859842Z     @given(
2025-05-07T20:31:54.9860180Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.9860642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.9861093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.9861595Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.9862062Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.9862362Z     )
2025-05-07T20:31:54.9862742Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.9863227Z     def test_silu_mul_quant(
2025-05-07T20:31:54.9863488Z         self,
2025-05-07T20:31:54.9863689Z         T: int,
2025-05-07T20:31:54.9863894Z         D: int,
2025-05-07T20:31:54.9864121Z         scale_ub: Optional[float],
2025-05-07T20:31:54.9864400Z         contiguous: bool,
2025-05-07T20:31:54.9865118Z         compiled: bool,
2025-05-07T20:31:54.9865360Z     ) -> None:
2025-05-07T20:31:54.9865580Z         torch.manual_seed(2025)
2025-05-07T20:31:54.9865834Z     
2025-05-07T20:31:54.9866125Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.9866479Z     
2025-05-07T20:31:54.9866675Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.9866976Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.9867300Z         x = x_sign * x_clamp
2025-05-07T20:31:54.9867548Z         x0 = x[:, :D]
2025-05-07T20:31:54.9867770Z         x1 = x[:, D:]
2025-05-07T20:31:54.9867973Z     
2025-05-07T20:31:54.9868163Z         if contiguous:
2025-05-07T20:31:54.9868398Z             x0 = x0.contiguous()
2025-05-07T20:31:54.9868659Z             x1 = x1.contiguous()
2025-05-07T20:31:54.9868908Z     
2025-05-07T20:31:54.9869105Z         if scale_ub is not None:
2025-05-07T20:31:54.9869390Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.9869894Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.9870218Z             )
2025-05-07T20:31:54.9870413Z         else:
2025-05-07T20:31:54.9870624Z             scale_ub_tensor = None
2025-05-07T20:31:54.9870890Z     
2025-05-07T20:31:54.9871127Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.9871452Z             op = silu_mul_quant
2025-05-07T20:31:54.9871711Z             if compiled:
2025-05-07T20:31:54.9871965Z                 op = torch.compile(op)
2025-05-07T20:31:54.9872274Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.9872566Z     
2025-05-07T20:31:54.9872760Z >       y_fp8, y_scale = fn()
2025-05-07T20:31:54.9872933Z 
2025-05-07T20:31:54.9873034Z moe/activation_test.py:117: 
2025-05-07T20:31:54.9873347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.9873707Z moe/activation_test.py:115: in fn
2025-05-07T20:31:54.9874000Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.9874765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:31:54.9875739Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:31:54.9876325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.9877058Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.9877776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.9878353Z     kernel = self.compile(
2025-05-07T20:31:54.9878931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.9879628Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.9880044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.9880291Z 
2025-05-07T20:31:54.9880513Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee047c7f0>
2025-05-07T20:31:54.9881699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.9883517Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ee08d40d0>}
2025-05-07T20:31:54.9885012Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.9886133Z context = <triton._C.libtriton.ir.context object at 0x7f1f2121b670>
2025-05-07T20:31:54.9886439Z 
2025-05-07T20:31:54.9886748Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.9887293Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.9887792Z                            module_map=module_map)
2025-05-07T20:31:54.9888167Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.9888531Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:31:54.9888793Z E       ^
2025-05-07T20:31:54.9889285Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.9889770Z 
2025-05-07T20:31:54.9890226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.9890782Z 
2025-05-07T20:31:54.9890890Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.9891313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.9891745Z     T=1,
2025-05-07T20:31:54.9891931Z     D=7168,
2025-05-07T20:31:54.9892119Z     scale_ub=None,
2025-05-07T20:31:54.9892335Z     contiguous=True,
2025-05-07T20:31:54.9892557Z     compiled=True,
2025-05-07T20:31:54.9892761Z )
2025-05-07T20:31:54.9893086Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:31:54.9893593Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:31:54.9893863Z 
2025-05-07T20:31:54.9893940Z     @given(
2025-05-07T20:31:54.9894170Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:31:54.9894494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:31:54.9894809Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:31:54.9895143Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:31:54.9895511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:31:54.9895827Z     )
2025-05-07T20:31:54.9896184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:31:54.9896655Z     def test_silu_mul_quant(
2025-05-07T20:31:54.9898388Z         self,
2025-05-07T20:31:54.9898579Z         T: int,
2025-05-07T20:31:54.9898778Z         D: int,
2025-05-07T20:31:54.9899136Z         scale_ub: Optional[float],
2025-05-07T20:31:54.9899410Z         contiguous: bool,
2025-05-07T20:31:54.9899655Z         compiled: bool,
2025-05-07T20:31:54.9899882Z     ) -> None:
2025-05-07T20:31:54.9900105Z         torch.manual_seed(2025)
2025-05-07T20:31:54.9900347Z     
2025-05-07T20:31:54.9900623Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:31:54.9900982Z     
2025-05-07T20:31:54.9901170Z         x_sign = torch.sign(x)
2025-05-07T20:31:54.9901466Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:31:54.9901786Z         x = x_sign * x_clamp
2025-05-07T20:31:54.9902027Z         x0 = x[:, :D]
2025-05-07T20:31:54.9902246Z         x1 = x[:, D:]
2025-05-07T20:31:54.9902453Z     
2025-05-07T20:31:54.9902640Z         if contiguous:
2025-05-07T20:31:54.9902871Z             x0 = x0.contiguous()
2025-05-07T20:31:54.9903139Z             x1 = x1.contiguous()
2025-05-07T20:31:54.9903377Z     
2025-05-07T20:31:54.9903576Z         if scale_ub is not None:
2025-05-07T20:31:54.9903858Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:31:54.9904194Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:31:54.9904514Z             )
2025-05-07T20:31:54.9904707Z         else:
2025-05-07T20:31:54.9904936Z             scale_ub_tensor = None
2025-05-07T20:31:54.9905217Z     
2025-05-07T20:31:54.9905450Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.9905774Z             op = silu_mul_quant
2025-05-07T20:31:54.9906018Z             if compiled:
2025-05-07T20:31:54.9906270Z                 op = torch.compile(op)
2025-05-07T20:31:54.9906572Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:31:54.9906848Z     
2025-05-07T20:31:54.9907043Z         y_fp8, y_scale = fn()
2025-05-07T20:31:54.9907421Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:31:54.9907712Z     
2025-05-07T20:31:54.9907957Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:31:54.9908307Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:31:54.9908603Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:31:54.9908927Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:31:54.9909298Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:54.9909619Z     
2025-05-07T20:31:54.9909936Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:31:54.9910143Z 
2025-05-07T20:31:54.9910264Z moe/activation_test.py:126: 
2025-05-07T20:31:54.9910568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.9910924Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:31:54.9911255Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:31:54.9912110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:31:54.9912934Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:31:54.9913513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:31:54.9914246Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:31:54.9914988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:31:54.9915763Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:54.9916571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:31:54.9917379Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:31:54.9918171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:31:54.9918937Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:31:54.9919578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:31:54.9920133Z     fn()
2025-05-07T20:31:54.9920672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:31:54.9921285Z     self.fn.run(
2025-05-07T20:31:54.9921776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:31:54.9922342Z     kernel = self.compile(
2025-05-07T20:31:54.9922917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:31:54.9923611Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:54.9924033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:31:54.9924275Z 
2025-05-07T20:31:54.9924499Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f1d1feb20>
2025-05-07T20:31:54.9925673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:31:54.9927186Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f21dbaa60>}
2025-05-07T20:31:54.9928656Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:31:54.9929764Z context = <triton._C.libtriton.ir.context object at 0x7f1ee22f5ef0>
2025-05-07T20:31:54.9930153Z 
2025-05-07T20:31:54.9930331Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:31:54.9930885Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:54.9931377Z                            module_map=module_map)
2025-05-07T20:31:54.9931756Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:54.9932123Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:31:54.9932392Z E       ^
2025-05-07T20:31:54.9932884Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:54.9933371Z 
2025-05-07T20:31:54.9933830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:31:54.9934387Z 
2025-05-07T20:31:54.9934489Z Trying example: test_silu_mul_quant(
2025-05-07T20:31:54.9934926Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:31:54.9935348Z     T=4096,
2025-05-07T20:31:54.9935539Z     D=5120,
2025-05-07T20:31:54.9935725Z     scale_ub=None,
2025-05-07T20:31:54.9935950Z     contiguous=False,
2025-05-07T20:31:54.9936176Z     compiled=False,
2025-05-07T20:31:54.9936376Z )
2025-05-07T20:31:55.6202699Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:55.6203886Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:55.6205370Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:55.6207047Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:55.6208915Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:55.6210446Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:55.6211886Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:55.6213408Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:55.6214989Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:55.6216368Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:55.6217705Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:55.6219039Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:55.6220179Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:55.6221497Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:55.6222838Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:55.6224244Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:55.6225466Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:55.6226610Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:55.6227911Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:55.6229410Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:55.6230744Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:55.6231741Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:55.6232546Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:55.6233757Z W0507 20:31:55.615908 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.2372202Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.2373729Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:56.2375294Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.2376887Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.2378444Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.2379977Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.2381420Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.2383197Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.2385128Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.2386500Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:56.2387844Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.2389176Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:56.2390428Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:56.2391553Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:56.2392885Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.2394295Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:56.2395516Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:56.2396661Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:56.2398111Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:56.2399609Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:56.2400771Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:56.2401763Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:56.2402565Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:56.2403676Z W0507 20:31:56.233270 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:56.9977001Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:56.9978187Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:56.9979678Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:56.9981274Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:56.9983488Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:56.9985266Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:56.9986935Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:56.9988703Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:56.9990505Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:56.9991879Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:56.9993220Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:56.9994554Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:56.9995697Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:56.9996822Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:56.9998309Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:56.9999733Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:57.0000955Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:57.0002097Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:57.0003404Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:57.0004896Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:57.0006058Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.0007047Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.0007845Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:57.0008955Z W0507 20:31:56.993694 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:57.0378061Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:57.0379212Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:31:57.0380680Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:57.0382249Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:57.0384018Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:57.0385541Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:57.0386978Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:57.0388492Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:57.0390169Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:57.0391743Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:31:57.0393086Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:57.0394420Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:31:57.0395563Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:31:57.0396689Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:31:57.0398032Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:57.0399440Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:57.0400658Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:31:57.0401814Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:31:57.0403108Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:57.0404732Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:57.0405913Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:57.0406926Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:57.0407725Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:31:57.0408839Z W0507 20:31:57.033100 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.5808069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.5809033Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:00.5809501Z 
2025-05-07T20:32:00.5809626Z     @given(
2025-05-07T20:32:00.5809995Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.5810510Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.5811012Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.5811559Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.5812107Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.5812581Z     )
2025-05-07T20:32:00.5813102Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.5813754Z     def test_silu_mul_quant(
2025-05-07T20:32:00.5814113Z         self,
2025-05-07T20:32:00.5814391Z         T: int,
2025-05-07T20:32:00.5814666Z         D: int,
2025-05-07T20:32:00.5814973Z         scale_ub: Optional[float],
2025-05-07T20:32:00.5815351Z         contiguous: bool,
2025-05-07T20:32:00.5816132Z         compiled: bool,
2025-05-07T20:32:00.5816517Z     ) -> None:
2025-05-07T20:32:00.5816815Z         torch.manual_seed(2025)
2025-05-07T20:32:00.5817159Z     
2025-05-07T20:32:00.5817547Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.5818039Z     
2025-05-07T20:32:00.5818311Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.5818729Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.5819176Z         x = x_sign * x_clamp
2025-05-07T20:32:00.5819522Z         x0 = x[:, :D]
2025-05-07T20:32:00.5819824Z         x1 = x[:, D:]
2025-05-07T20:32:00.5820119Z     
2025-05-07T20:32:00.5820380Z         if contiguous:
2025-05-07T20:32:00.5820698Z             x0 = x0.contiguous()
2025-05-07T20:32:00.5821071Z             x1 = x1.contiguous()
2025-05-07T20:32:00.5821417Z     
2025-05-07T20:32:00.5821683Z         if scale_ub is not None:
2025-05-07T20:32:00.5822072Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.5822640Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.5823171Z             )
2025-05-07T20:32:00.5823492Z         else:
2025-05-07T20:32:00.5823840Z             scale_ub_tensor = None
2025-05-07T20:32:00.5824261Z     
2025-05-07T20:32:00.5824650Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.5825199Z             op = silu_mul_quant
2025-05-07T20:32:00.5825612Z             if compiled:
2025-05-07T20:32:00.5826030Z                 op = torch.compile(op)
2025-05-07T20:32:00.5826588Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.5827059Z     
2025-05-07T20:32:00.5827378Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.5827673Z 
2025-05-07T20:32:00.5827837Z moe/activation_test.py:117: 
2025-05-07T20:32:00.5828348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.5829161Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.5829649Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.5831255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.5832730Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.5833857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.5835227Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.5836316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.5837114Z     kernel = self.compile(
2025-05-07T20:32:00.5837948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.5839009Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.5839678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.5840069Z 
2025-05-07T20:32:00.5840401Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee23a4550>
2025-05-07T20:32:00.5842234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.5844635Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f21ae0700>}
2025-05-07T20:32:00.5846993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.5848786Z context = <triton._C.libtriton.ir.context object at 0x7f1ee02f5bb0>
2025-05-07T20:32:00.5849281Z 
2025-05-07T20:32:00.5849722Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.5850634Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.5851447Z                            module_map=module_map)
2025-05-07T20:32:00.5852047Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.5852641Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.5853074Z E       ^
2025-05-07T20:32:00.5853861Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.5854666Z 
2025-05-07T20:32:00.5855393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.5856294Z 
2025-05-07T20:32:00.5856469Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.5857158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.5857824Z     T=4096,
2025-05-07T20:32:00.5858136Z     D=7168,
2025-05-07T20:32:00.5858440Z     scale_ub=None,
2025-05-07T20:32:00.5858778Z     contiguous=False,
2025-05-07T20:32:00.5859146Z     compiled=False,
2025-05-07T20:32:00.5859480Z )
2025-05-07T20:32:00.5859996Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.5860841Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:00.5861319Z 
2025-05-07T20:32:00.5861444Z     @given(
2025-05-07T20:32:00.5861806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.5862313Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.5862820Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.5863371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.5864033Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.5864508Z     )
2025-05-07T20:32:00.5865099Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.5865844Z     def test_silu_mul_quant(
2025-05-07T20:32:00.5866226Z         self,
2025-05-07T20:32:00.5866525Z         T: int,
2025-05-07T20:32:00.5866826Z         D: int,
2025-05-07T20:32:00.5867216Z         scale_ub: Optional[float],
2025-05-07T20:32:00.5867636Z         contiguous: bool,
2025-05-07T20:32:00.5867991Z         compiled: bool,
2025-05-07T20:32:00.5868321Z     ) -> None:
2025-05-07T20:32:00.5868633Z         torch.manual_seed(2025)
2025-05-07T20:32:00.5869009Z     
2025-05-07T20:32:00.5869394Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.5870110Z     
2025-05-07T20:32:00.5870396Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.5870847Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.5871365Z         x = x_sign * x_clamp
2025-05-07T20:32:00.5871742Z         x0 = x[:, :D]
2025-05-07T20:32:00.5872078Z         x1 = x[:, D:]
2025-05-07T20:32:00.5872401Z     
2025-05-07T20:32:00.5872689Z         if contiguous:
2025-05-07T20:32:00.5873041Z             x0 = x0.contiguous()
2025-05-07T20:32:00.5873455Z             x1 = x1.contiguous()
2025-05-07T20:32:00.5873841Z     
2025-05-07T20:32:00.5874135Z         if scale_ub is not None:
2025-05-07T20:32:00.5874577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.5875123Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.5875635Z             )
2025-05-07T20:32:00.5875930Z         else:
2025-05-07T20:32:00.5876257Z             scale_ub_tensor = None
2025-05-07T20:32:00.5876666Z     
2025-05-07T20:32:00.5877022Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.5877548Z             op = silu_mul_quant
2025-05-07T20:32:00.5877956Z             if compiled:
2025-05-07T20:32:00.5878351Z                 op = torch.compile(op)
2025-05-07T20:32:00.5878839Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.5879279Z     
2025-05-07T20:32:00.5879574Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.5879850Z 
2025-05-07T20:32:00.5880140Z moe/activation_test.py:117: 
2025-05-07T20:32:00.5880636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.5881184Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.5881644Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.5883098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.5884326Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.5885239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.5886420Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.5887614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.5888539Z     kernel = self.compile(
2025-05-07T20:32:00.5889472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.5890615Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.5891288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.5891682Z 
2025-05-07T20:32:00.5892032Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee227d340>
2025-05-07T20:32:00.5893901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.5896377Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f1c6960d0>}
2025-05-07T20:32:00.5898991Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.5900800Z context = <triton._C.libtriton.ir.context object at 0x7f1ee224d770>
2025-05-07T20:32:00.5901298Z 
2025-05-07T20:32:00.5901573Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.5902476Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.5903274Z                            module_map=module_map)
2025-05-07T20:32:00.5903865Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.5904434Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.5904859Z E       ^
2025-05-07T20:32:00.5905657Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.5906449Z 
2025-05-07T20:32:00.5907176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.5908063Z 
2025-05-07T20:32:00.5908230Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.5908913Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.5909551Z     T=128,
2025-05-07T20:32:00.5909922Z     D=7168,
2025-05-07T20:32:00.5910221Z     scale_ub=None,
2025-05-07T20:32:00.5910558Z     contiguous=False,
2025-05-07T20:32:00.5910902Z     compiled=True,
2025-05-07T20:32:00.5911218Z )
2025-05-07T20:32:00.6710034Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.6710941Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:00.6711393Z 
2025-05-07T20:32:00.6711547Z     @given(
2025-05-07T20:32:00.6711922Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.6712434Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.6713365Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.6713911Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.6714385Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.6714824Z     )
2025-05-07T20:32:00.6715355Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.6716051Z     def test_silu_mul_quant(
2025-05-07T20:32:00.6716448Z         self,
2025-05-07T20:32:00.6716742Z         T: int,
2025-05-07T20:32:00.6717035Z         D: int,
2025-05-07T20:32:00.6717383Z         scale_ub: Optional[float],
2025-05-07T20:32:00.6717839Z         contiguous: bool,
2025-05-07T20:32:00.6718215Z         compiled: bool,
2025-05-07T20:32:00.6718583Z     ) -> None:
2025-05-07T20:32:00.6718935Z         torch.manual_seed(2025)
2025-05-07T20:32:00.6719359Z     
2025-05-07T20:32:00.6719822Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.6720398Z     
2025-05-07T20:32:00.6720697Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.6721143Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.6721595Z         x = x_sign * x_clamp
2025-05-07T20:32:00.6721936Z         x0 = x[:, :D]
2025-05-07T20:32:00.6722233Z         x1 = x[:, D:]
2025-05-07T20:32:00.6722529Z     
2025-05-07T20:32:00.6722791Z         if contiguous:
2025-05-07T20:32:00.6723106Z             x0 = x0.contiguous()
2025-05-07T20:32:00.6723460Z             x1 = x1.contiguous()
2025-05-07T20:32:00.6723803Z     
2025-05-07T20:32:00.6724066Z         if scale_ub is not None:
2025-05-07T20:32:00.6724453Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.6724963Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.6725465Z             )
2025-05-07T20:32:00.6725748Z         else:
2025-05-07T20:32:00.6726426Z             scale_ub_tensor = None
2025-05-07T20:32:00.6726853Z     
2025-05-07T20:32:00.6727227Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.6727720Z             op = silu_mul_quant
2025-05-07T20:32:00.6728092Z             if compiled:
2025-05-07T20:32:00.6728468Z                 op = torch.compile(op)
2025-05-07T20:32:00.6728921Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.6729351Z     
2025-05-07T20:32:00.6729641Z         y_fp8, y_scale = fn()
2025-05-07T20:32:00.6730100Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:00.6730566Z     
2025-05-07T20:32:00.6730938Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.6731484Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:00.6731976Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:00.6732497Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:00.6733104Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.6733645Z     
2025-05-07T20:32:00.6733977Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:00.6734302Z 
2025-05-07T20:32:00.6734476Z moe/activation_test.py:126: 
2025-05-07T20:32:00.6734942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.6735490Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:00.6735995Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.6737267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:00.6738601Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:00.6739571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.6740781Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.6742023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:00.6743466Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.6744828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:00.6746184Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.6747495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:00.6748505Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:00.6749401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:00.6750386Z     fn()
2025-05-07T20:32:00.6751232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:00.6752264Z     self.fn.run(
2025-05-07T20:32:00.6753026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.6753955Z     kernel = self.compile(
2025-05-07T20:32:00.6754885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.6755999Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.6756669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.6757075Z 
2025-05-07T20:32:00.6757413Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee2275940>
2025-05-07T20:32:00.6759309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.6761970Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f1d2114c0>}
2025-05-07T20:32:00.6764364Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.6766163Z context = <triton._C.libtriton.ir.context object at 0x7f1d7dd4deb0>
2025-05-07T20:32:00.6766671Z 
2025-05-07T20:32:00.6766948Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.6767847Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.6768653Z                            module_map=module_map)
2025-05-07T20:32:00.6769253Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.6769841Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:00.6770289Z E       ^
2025-05-07T20:32:00.6771078Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.6771898Z 
2025-05-07T20:32:00.6772636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.6773539Z 
2025-05-07T20:32:00.6773706Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.6774411Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.6775086Z     T=128,
2025-05-07T20:32:00.6775382Z     D=7168,
2025-05-07T20:32:00.6775680Z     scale_ub=None,
2025-05-07T20:32:00.6776017Z     contiguous=False,
2025-05-07T20:32:00.6776399Z     compiled=False,
2025-05-07T20:32:00.6776747Z )
2025-05-07T20:32:00.9290537Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.9291182Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:00.9291611Z 
2025-05-07T20:32:00.9291722Z     @given(
2025-05-07T20:32:00.9292048Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.9292684Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.9293008Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.9293357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.9293704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.9294008Z     )
2025-05-07T20:32:00.9294370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.9294849Z     def test_silu_mul_quant(
2025-05-07T20:32:00.9295099Z         self,
2025-05-07T20:32:00.9295288Z         T: int,
2025-05-07T20:32:00.9295491Z         D: int,
2025-05-07T20:32:00.9295721Z         scale_ub: Optional[float],
2025-05-07T20:32:00.9295999Z         contiguous: bool,
2025-05-07T20:32:00.9296251Z         compiled: bool,
2025-05-07T20:32:00.9296496Z     ) -> None:
2025-05-07T20:32:00.9296714Z         torch.manual_seed(2025)
2025-05-07T20:32:00.9296973Z     
2025-05-07T20:32:00.9297273Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.9297635Z     
2025-05-07T20:32:00.9297871Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.9298174Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.9298508Z         x = x_sign * x_clamp
2025-05-07T20:32:00.9298763Z         x0 = x[:, :D]
2025-05-07T20:32:00.9298985Z         x1 = x[:, D:]
2025-05-07T20:32:00.9299206Z     
2025-05-07T20:32:00.9299399Z         if contiguous:
2025-05-07T20:32:00.9299636Z             x0 = x0.contiguous()
2025-05-07T20:32:00.9299910Z             x1 = x1.contiguous()
2025-05-07T20:32:00.9300164Z     
2025-05-07T20:32:00.9300359Z         if scale_ub is not None:
2025-05-07T20:32:00.9300647Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.9301002Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.9301497Z             )
2025-05-07T20:32:00.9301686Z         else:
2025-05-07T20:32:00.9301900Z             scale_ub_tensor = None
2025-05-07T20:32:00.9302164Z     
2025-05-07T20:32:00.9302401Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.9302726Z             op = silu_mul_quant
2025-05-07T20:32:00.9302983Z             if compiled:
2025-05-07T20:32:00.9303233Z                 op = torch.compile(op)
2025-05-07T20:32:00.9303541Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.9303830Z     
2025-05-07T20:32:00.9304017Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.9304193Z 
2025-05-07T20:32:00.9304295Z moe/activation_test.py:117: 
2025-05-07T20:32:00.9304602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.9304954Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.9305236Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.9305987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.9306758Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.9307327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.9308070Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.9308785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.9309360Z     kernel = self.compile(
2025-05-07T20:32:00.9310070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.9310780Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.9311197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.9311441Z 
2025-05-07T20:32:00.9311660Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7dc8ac10>
2025-05-07T20:32:00.9312923Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.9314746Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ee0208dc0>}
2025-05-07T20:32:00.9316450Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.9317728Z context = <triton._C.libtriton.ir.context object at 0x7f1ee01eab70>
2025-05-07T20:32:00.9318070Z 
2025-05-07T20:32:00.9318247Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.9318797Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.9319292Z                            module_map=module_map)
2025-05-07T20:32:00.9319676Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.9320035Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.9320299Z E       ^
2025-05-07T20:32:00.9320795Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.9321290Z 
2025-05-07T20:32:00.9321746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.9322302Z 
2025-05-07T20:32:00.9322405Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.9322836Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.9323262Z     T=4096,
2025-05-07T20:32:00.9323443Z     D=5120,
2025-05-07T20:32:00.9323730Z     scale_ub=1200.0,
2025-05-07T20:32:00.9323955Z     contiguous=True,
2025-05-07T20:32:00.9324175Z     compiled=False,
2025-05-07T20:32:00.9324387Z )
2025-05-07T20:32:00.9324718Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.9325240Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:00.9325531Z 
2025-05-07T20:32:00.9325610Z     @given(
2025-05-07T20:32:00.9325840Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.9326164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.9326475Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.9326816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.9327159Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.9327452Z     )
2025-05-07T20:32:00.9327819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.9328296Z     def test_silu_mul_quant(
2025-05-07T20:32:00.9328547Z         self,
2025-05-07T20:32:00.9328754Z         T: int,
2025-05-07T20:32:00.9328955Z         D: int,
2025-05-07T20:32:00.9329179Z         scale_ub: Optional[float],
2025-05-07T20:32:00.9329454Z         contiguous: bool,
2025-05-07T20:32:00.9329704Z         compiled: bool,
2025-05-07T20:32:00.9329933Z     ) -> None:
2025-05-07T20:32:00.9330144Z         torch.manual_seed(2025)
2025-05-07T20:32:00.9330391Z     
2025-05-07T20:32:00.9330670Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.9331027Z     
2025-05-07T20:32:00.9331226Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.9331526Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.9340622Z         x = x_sign * x_clamp
2025-05-07T20:32:00.9340917Z         x0 = x[:, :D]
2025-05-07T20:32:00.9341152Z         x1 = x[:, D:]
2025-05-07T20:32:00.9341365Z     
2025-05-07T20:32:00.9341565Z         if contiguous:
2025-05-07T20:32:00.9341823Z             x0 = x0.contiguous()
2025-05-07T20:32:00.9342088Z             x1 = x1.contiguous()
2025-05-07T20:32:00.9342342Z     
2025-05-07T20:32:00.9342546Z         if scale_ub is not None:
2025-05-07T20:32:00.9342948Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.9343305Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.9343634Z             )
2025-05-07T20:32:00.9343828Z         else:
2025-05-07T20:32:00.9344047Z             scale_ub_tensor = None
2025-05-07T20:32:00.9344312Z     
2025-05-07T20:32:00.9344547Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.9344880Z             op = silu_mul_quant
2025-05-07T20:32:00.9345141Z             if compiled:
2025-05-07T20:32:00.9345398Z                 op = torch.compile(op)
2025-05-07T20:32:00.9345703Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.9345997Z     
2025-05-07T20:32:00.9346193Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.9346363Z 
2025-05-07T20:32:00.9346467Z moe/activation_test.py:117: 
2025-05-07T20:32:00.9346781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.9347140Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.9347428Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.9348198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.9348962Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.9349536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.9350461Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.9351263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.9351903Z     kernel = self.compile(
2025-05-07T20:32:00.9352540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.9353336Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.9353754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.9353996Z 
2025-05-07T20:32:00.9354215Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7de089d0>
2025-05-07T20:32:00.9355385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.9356892Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ee1fe4dc0>}
2025-05-07T20:32:00.9358359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.9359472Z context = <triton._C.libtriton.ir.context object at 0x7f1d7dd32cb0>
2025-05-07T20:32:00.9359782Z 
2025-05-07T20:32:00.9359958Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.9360501Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.9360995Z                            module_map=module_map)
2025-05-07T20:32:00.9361372Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.9361729Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.9361994Z E       ^
2025-05-07T20:32:00.9362493Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.9362982Z 
2025-05-07T20:32:00.9363436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.9363997Z 
2025-05-07T20:32:00.9364099Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.9364613Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.9365040Z     T=1,
2025-05-07T20:32:00.9365215Z     D=5120,
2025-05-07T20:32:00.9365407Z     scale_ub=None,
2025-05-07T20:32:00.9365623Z     contiguous=True,
2025-05-07T20:32:00.9365847Z     compiled=True,
2025-05-07T20:32:00.9366051Z )
2025-05-07T20:32:01.4706991Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:01.4708286Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:01.4709847Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:01.4711458Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:01.4712989Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:01.4714519Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.4715962Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:01.4717770Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.4719335Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:01.4720710Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:01.4722052Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:01.4723388Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:01.4724526Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:01.4725640Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:01.4726980Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:01.4728390Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:01.4729614Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:01.4730903Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:01.4732194Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:01.4733689Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:01.4734848Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.4735836Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.4736631Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:01.4737754Z W0507 20:32:01.466499 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.6584689Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:01.6586127Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:01.6587765Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:01.6589629Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:01.6591281Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:01.6592812Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.6594256Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:01.6595786Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.6597354Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:01.6598722Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:01.6600078Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:01.6601413Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:01.6602709Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:01.6603826Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:01.6605157Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:01.6606564Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:01.6607781Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:01.6608922Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:01.6610221Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:01.6611711Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:01.6612873Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.6613860Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.6614772Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:01.6615885Z W0507 20:32:01.654495 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.1657469Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:02.1658671Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:02.1661383Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:02.1662988Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:02.1664520Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:02.1666037Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.1667475Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:02.1668988Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.1670950Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:02.1672328Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:02.1673666Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:02.1674988Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:02.1676124Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:02.1677245Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:02.1678579Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:02.1679980Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:02.1681197Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:02.1682486Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:02.1684201Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:02.1685696Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:02.1686847Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.1687837Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.1688635Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:02.1689756Z W0507 20:32:02.161635 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.2054352Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:02.2056620Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:02.2059543Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:02.2062658Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:02.2064702Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:02.2066228Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.2067654Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:02.2069217Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.2070914Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:02.2072283Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:02.2073618Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:02.2074940Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:02.2076065Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:02.2077321Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:02.2078655Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:02.2080060Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:02.2081269Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:02.2082403Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:02.2083948Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:02.2085440Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:02.2086594Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.2087575Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.2088376Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:02.2090119Z W0507 20:32:02.201485 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.5528823Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:02.5529441Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:02.5529834Z 
2025-05-07T20:32:02.5529917Z     @given(
2025-05-07T20:32:02.5530154Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:02.5530494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:02.5530803Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:02.5531148Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:02.5531487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:02.5531784Z     )
2025-05-07T20:32:02.5532142Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:02.5532634Z     def test_silu_mul_quant(
2025-05-07T20:32:02.5532881Z         self,
2025-05-07T20:32:02.5533068Z         T: int,
2025-05-07T20:32:02.5533270Z         D: int,
2025-05-07T20:32:02.5533502Z         scale_ub: Optional[float],
2025-05-07T20:32:02.5533774Z         contiguous: bool,
2025-05-07T20:32:02.5534019Z         compiled: bool,
2025-05-07T20:32:02.5534254Z     ) -> None:
2025-05-07T20:32:02.5534465Z         torch.manual_seed(2025)
2025-05-07T20:32:02.5534716Z     
2025-05-07T20:32:02.5534996Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:02.5535351Z     
2025-05-07T20:32:02.5535543Z         x_sign = torch.sign(x)
2025-05-07T20:32:02.5535837Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:02.5536156Z         x = x_sign * x_clamp
2025-05-07T20:32:02.5536396Z         x0 = x[:, :D]
2025-05-07T20:32:02.5536615Z         x1 = x[:, D:]
2025-05-07T20:32:02.5536848Z     
2025-05-07T20:32:02.5537330Z         if contiguous:
2025-05-07T20:32:02.5537563Z             x0 = x0.contiguous()
2025-05-07T20:32:02.5537823Z             x1 = x1.contiguous()
2025-05-07T20:32:02.5538058Z     
2025-05-07T20:32:02.5538253Z         if scale_ub is not None:
2025-05-07T20:32:02.5538533Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:02.5538873Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:02.5539193Z             )
2025-05-07T20:32:02.5539384Z         else:
2025-05-07T20:32:02.5539588Z             scale_ub_tensor = None
2025-05-07T20:32:02.5539847Z     
2025-05-07T20:32:02.5540081Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.5540399Z             op = silu_mul_quant
2025-05-07T20:32:02.5540652Z             if compiled:
2025-05-07T20:32:02.5540904Z                 op = torch.compile(op)
2025-05-07T20:32:02.5541206Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:02.5541484Z     
2025-05-07T20:32:02.5541687Z         y_fp8, y_scale = fn()
2025-05-07T20:32:02.5541973Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:02.5542268Z     
2025-05-07T20:32:02.5542508Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:02.5542859Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:02.5543154Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:02.5543480Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:02.5543853Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:02.5544169Z     
2025-05-07T20:32:02.5544371Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:02.5544574Z 
2025-05-07T20:32:02.5544683Z moe/activation_test.py:126: 
2025-05-07T20:32:02.5544995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.5545343Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:02.5545680Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:02.5546539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:02.5547550Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:02.5548136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:02.5548872Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:02.5549612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:02.5550525Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:02.5551338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:02.5552142Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:02.5552934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:02.5553616Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:02.5554260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:02.5554809Z     fn()
2025-05-07T20:32:02.5555337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:02.5555956Z     self.fn.run(
2025-05-07T20:32:02.5556447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:02.5557009Z     kernel = self.compile(
2025-05-07T20:32:02.5557573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:02.5558268Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.5558779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:02.5559024Z 
2025-05-07T20:32:02.5559240Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee1fedc70>
2025-05-07T20:32:02.5560412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:02.5561953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ee021f700>}
2025-05-07T20:32:02.5563419Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:02.5564527Z context = <triton._C.libtriton.ir.context object at 0x7f1d7d8e18b0>
2025-05-07T20:32:02.5564837Z 
2025-05-07T20:32:02.5565008Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:02.5565558Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.5566051Z                            module_map=module_map)
2025-05-07T20:32:02.5566427Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.5566785Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:02.5567060Z E       ^
2025-05-07T20:32:02.5567550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.5568037Z 
2025-05-07T20:32:02.5568494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.5569048Z 
2025-05-07T20:32:02.5569148Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:02.5569579Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:02.5569999Z     T=2048,
2025-05-07T20:32:02.5570182Z     D=5120,
2025-05-07T20:32:02.5570380Z     scale_ub=None,
2025-05-07T20:32:02.5570676Z     contiguous=True,
2025-05-07T20:32:02.5570895Z     compiled=True,
2025-05-07T20:32:02.5571101Z )
2025-05-07T20:32:03.0501592Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:03.0502770Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:03.0504255Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:03.0505874Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:03.0507410Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:03.0508942Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.0510532Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:03.0512057Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.0513945Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:03.0515324Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:03.0516670Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:03.0518053Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:03.0519192Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:03.0520326Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:03.0521662Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:03.0523083Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:03.0524304Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:03.0525451Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:03.0526894Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:03.0528391Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:03.0529553Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.0530542Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.0531350Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:03.0532465Z W0507 20:32:03.046074 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.2376732Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:03.2378290Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:03.2379779Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:03.2381366Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:03.2383453Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:03.2384984Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.2386428Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:03.2387948Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.2389522Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:03.2390996Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:03.2392338Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:03.2393663Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:03.2394797Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:03.2396063Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:03.2397405Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:03.2398810Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:03.2400031Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:03.2401174Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:03.2402476Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:03.2403972Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:03.2405129Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.2406122Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.2406930Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:03.2408161Z W0507 20:32:03.233623 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.7439221Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:03.7440407Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:03.7441898Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:03.7443497Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:03.7445041Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:03.7446572Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.7448022Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:03.7449553Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.7451484Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:03.7452873Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:03.7454219Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:03.7455551Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:03.7456690Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:03.7457816Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:03.7459214Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:03.7460618Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:03.7461838Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:03.7462982Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:03.7464432Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:03.7465929Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:03.7467088Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.7468078Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.7468879Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:03.7470164Z W0507 20:32:03.739933 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.7831384Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:03.7832524Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:03.7833985Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:03.7835542Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:03.7837208Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:03.7838726Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.7840150Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:03.7841646Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.7843201Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:03.7844578Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:03.7845937Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:03.7847260Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:03.7848385Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:03.7849627Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:03.7850970Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:03.7852377Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:03.7853587Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:03.7854727Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:03.7856019Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:03.7857520Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:03.7858680Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.7859664Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.7860466Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:03.7861586Z W0507 20:32:03.779295 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.2758739Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.2759809Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:04.2760209Z 
2025-05-07T20:32:04.2760320Z     @given(
2025-05-07T20:32:04.2760625Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.2761030Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.2761383Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.2761727Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.2762066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.2762558Z     )
2025-05-07T20:32:04.2762935Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.2763402Z     def test_silu_mul_quant(
2025-05-07T20:32:04.2763652Z         self,
2025-05-07T20:32:04.2763857Z         T: int,
2025-05-07T20:32:04.2764051Z         D: int,
2025-05-07T20:32:04.2764272Z         scale_ub: Optional[float],
2025-05-07T20:32:04.2764554Z         contiguous: bool,
2025-05-07T20:32:04.2764796Z         compiled: bool,
2025-05-07T20:32:04.2765038Z     ) -> None:
2025-05-07T20:32:04.2765258Z         torch.manual_seed(2025)
2025-05-07T20:32:04.2765501Z     
2025-05-07T20:32:04.2765781Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.2766143Z     
2025-05-07T20:32:04.2766344Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.2766647Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.2766964Z         x = x_sign * x_clamp
2025-05-07T20:32:04.2767210Z         x0 = x[:, :D]
2025-05-07T20:32:04.2767435Z         x1 = x[:, D:]
2025-05-07T20:32:04.2767644Z     
2025-05-07T20:32:04.2767835Z         if contiguous:
2025-05-07T20:32:04.2768071Z             x0 = x0.contiguous()
2025-05-07T20:32:04.2768327Z             x1 = x1.contiguous()
2025-05-07T20:32:04.2768748Z     
2025-05-07T20:32:04.2768938Z         if scale_ub is not None:
2025-05-07T20:32:04.2769210Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.2769560Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.2769880Z             )
2025-05-07T20:32:04.2770068Z         else:
2025-05-07T20:32:04.2770279Z             scale_ub_tensor = None
2025-05-07T20:32:04.2770538Z     
2025-05-07T20:32:04.2770773Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.2771091Z             op = silu_mul_quant
2025-05-07T20:32:04.2771344Z             if compiled:
2025-05-07T20:32:04.2771596Z                 op = torch.compile(op)
2025-05-07T20:32:04.2771895Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.2772180Z     
2025-05-07T20:32:04.2772372Z         y_fp8, y_scale = fn()
2025-05-07T20:32:04.2772656Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:04.2772958Z     
2025-05-07T20:32:04.2773202Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.2773546Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:04.2773848Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:04.2774173Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:04.2774544Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:04.2774863Z     
2025-05-07T20:32:04.2775071Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:04.2775272Z 
2025-05-07T20:32:04.2775376Z moe/activation_test.py:126: 
2025-05-07T20:32:04.2775674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.2776032Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:04.2776373Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:04.2777227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:04.2778062Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:04.2778647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.2779570Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.2780311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:04.2781089Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:04.2781904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:04.2782718Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:04.2783773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:04.2784464Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:04.2785108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:04.2785663Z     fn()
2025-05-07T20:32:04.2786222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:04.2786851Z     self.fn.run(
2025-05-07T20:32:04.2787376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.2787970Z     kernel = self.compile(
2025-05-07T20:32:04.2788553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.2789259Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.2789811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.2790059Z 
2025-05-07T20:32:04.2790418Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee21b80d0>
2025-05-07T20:32:04.2791602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.2793129Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7dcbe790>}
2025-05-07T20:32:04.2794603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.2795717Z context = <triton._C.libtriton.ir.context object at 0x7f1d7d65b6f0>
2025-05-07T20:32:04.2796022Z 
2025-05-07T20:32:04.2796192Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.2796749Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.2797243Z                            module_map=module_map)
2025-05-07T20:32:04.2797618Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.2797988Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:04.2798262Z E       ^
2025-05-07T20:32:04.2798753Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.2799248Z 
2025-05-07T20:32:04.2799696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.2800258Z 
2025-05-07T20:32:04.2800359Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.2800790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.2801207Z     T=128,
2025-05-07T20:32:04.2801403Z     D=5120,
2025-05-07T20:32:04.2801598Z     scale_ub=None,
2025-05-07T20:32:04.2801812Z     contiguous=True,
2025-05-07T20:32:04.2802037Z     compiled=True,
2025-05-07T20:32:04.2802246Z )
2025-05-07T20:32:04.8082308Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.8083772Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:04.8085269Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.8086930Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.8088523Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.8090059Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.8091505Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.8093029Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.8094799Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.8096174Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:04.8097518Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.8098850Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:04.8099987Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:04.8101111Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:04.8102455Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.8103872Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.8105096Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:04.8106237Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:04.8107539Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.8109148Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.8110475Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.8111467Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.8112275Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:04.8113386Z W0507 20:32:04.804106 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.0006322Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.0007511Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:05.0009003Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.0010578Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.0012118Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.0013942Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.0015393Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.0016911Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.0018475Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.0019861Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:05.0021211Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.0022541Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:05.0023677Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:05.0024786Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:05.0026268Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.0027677Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.0028897Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:05.0030959Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:05.0032245Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.0033750Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.0034905Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.0035890Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.0036684Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:05.0037799Z W0507 20:32:04.996531 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5111089Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.5113386Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:05.5116320Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.5118805Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.5120338Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.5122201Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5123641Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.5125159Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5126716Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.5128378Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:05.5129721Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.5131049Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:05.5132188Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:05.5133314Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:05.5134659Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.5136066Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.5137277Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:05.5138416Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:05.5139708Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.5141357Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.5142517Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5143501Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5144299Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:05.5145411Z W0507 20:32:05.506984 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.5512796Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.5514132Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:05.5515594Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.5517173Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.5518788Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.5520533Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.5521978Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.5523493Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.5525063Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.5526439Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:05.5527838Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.5529170Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:05.5530577Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:05.5531700Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:05.5533207Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.5534625Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.5535847Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:05.5536984Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:05.5538284Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.5539790Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.5540953Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.5541945Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.5542738Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:05.5543855Z W0507 20:32:05.547371 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.9900297Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.9900991Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.9901281Z 
2025-05-07T20:32:05.9901359Z     @given(
2025-05-07T20:32:05.9901861Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.9902193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.9902506Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.9902853Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.9903193Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.9903488Z     )
2025-05-07T20:32:05.9903848Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.9904317Z     def test_silu_mul_quant(
2025-05-07T20:32:05.9904564Z         self,
2025-05-07T20:32:05.9904755Z         T: int,
2025-05-07T20:32:05.9904961Z         D: int,
2025-05-07T20:32:05.9905184Z         scale_ub: Optional[float],
2025-05-07T20:32:05.9905458Z         contiguous: bool,
2025-05-07T20:32:05.9905705Z         compiled: bool,
2025-05-07T20:32:05.9905939Z     ) -> None:
2025-05-07T20:32:05.9906153Z         torch.manual_seed(2025)
2025-05-07T20:32:05.9906401Z     
2025-05-07T20:32:05.9906692Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.9907049Z     
2025-05-07T20:32:05.9907246Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.9907555Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.9907915Z         x = x_sign * x_clamp
2025-05-07T20:32:05.9908165Z         x0 = x[:, :D]
2025-05-07T20:32:05.9908387Z         x1 = x[:, D:]
2025-05-07T20:32:05.9908601Z     
2025-05-07T20:32:05.9908782Z         if contiguous:
2025-05-07T20:32:05.9909020Z             x0 = x0.contiguous()
2025-05-07T20:32:05.9909286Z             x1 = x1.contiguous()
2025-05-07T20:32:05.9909531Z     
2025-05-07T20:32:05.9909901Z         if scale_ub is not None:
2025-05-07T20:32:05.9910189Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.9910723Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.9911042Z             )
2025-05-07T20:32:05.9911236Z         else:
2025-05-07T20:32:05.9911455Z             scale_ub_tensor = None
2025-05-07T20:32:05.9911715Z     
2025-05-07T20:32:05.9911954Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.9912276Z             op = silu_mul_quant
2025-05-07T20:32:05.9912532Z             if compiled:
2025-05-07T20:32:05.9912789Z                 op = torch.compile(op)
2025-05-07T20:32:05.9913087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.9913376Z     
2025-05-07T20:32:05.9913569Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.9913859Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.9914152Z     
2025-05-07T20:32:05.9914391Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.9914739Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.9915043Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.9915370Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.9915751Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.9916176Z     
2025-05-07T20:32:05.9916415Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.9916666Z 
2025-05-07T20:32:05.9916775Z moe/activation_test.py:126: 
2025-05-07T20:32:05.9917081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.9917441Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.9917784Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.9918639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.9919455Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.9920038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.9920779Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.9921678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.9922450Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.9923264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.9924069Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.9924850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.9925528Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.9926169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.9926725Z     fn()
2025-05-07T20:32:05.9927260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.9927892Z     self.fn.run(
2025-05-07T20:32:05.9928391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.9928959Z     kernel = self.compile(
2025-05-07T20:32:05.9929533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.9930236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.9930654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.9930903Z 
2025-05-07T20:32:05.9931118Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee0367250>
2025-05-07T20:32:05.9932299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.9933918Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7db5ea60>}
2025-05-07T20:32:05.9935387Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.9936498Z context = <triton._C.libtriton.ir.context object at 0x7f1d7d0b5cf0>
2025-05-07T20:32:05.9936802Z 
2025-05-07T20:32:05.9936971Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.9937521Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.9938031Z                            module_map=module_map)
2025-05-07T20:32:05.9938447Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.9938816Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.9939089Z E       ^
2025-05-07T20:32:05.9939581Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.9940067Z 
2025-05-07T20:32:05.9940512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.9941073Z 
2025-05-07T20:32:05.9941174Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.9941597Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.9942016Z     T=4096,
2025-05-07T20:32:05.9942197Z     D=5120,
2025-05-07T20:32:05.9942387Z     scale_ub=None,
2025-05-07T20:32:05.9942599Z     contiguous=True,
2025-05-07T20:32:05.9942814Z     compiled=True,
2025-05-07T20:32:05.9943022Z )
2025-05-07T20:32:06.5234940Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.5236132Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:06.5237616Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.5239254Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.5240783Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.5242319Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.5243760Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.5245277Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.5246839Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.5248368Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:06.5249702Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.5251033Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:06.5252165Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:06.5253280Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:06.5254625Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.5256028Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.5257246Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:06.5258385Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:06.5259679Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.5261260Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.5262414Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.5263404Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.5264208Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:06.5265320Z W0507 20:32:06.519389 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.7133209Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.7134409Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:06.7135883Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.7137468Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.7138995Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.7140800Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.7142229Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.7143746Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.7145307Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.7146689Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:06.7148035Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.7149358Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:06.7150607Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:06.7151731Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:06.7153076Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.7154627Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.7155845Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:06.7156992Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:06.7158292Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.7159815Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.7160982Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.7161974Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.7162773Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:06.7163881Z W0507 20:32:06.709325 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.2234849Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:07.2236387Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:07.2237876Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:07.2247532Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:07.2249090Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:07.2250646Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.2252098Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:07.2253619Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.2255189Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:07.2256569Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:07.2258136Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:07.2259469Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:07.2260606Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:07.2261726Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:07.2263069Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:07.2264493Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:07.2265707Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:07.2266852Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:07.2268153Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:07.2269653Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:07.2271031Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.2272024Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.2272827Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:07.2273948Z W0507 20:32:07.219366 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.2634307Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:07.2636620Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:07.2638936Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:07.2640512Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:07.2642050Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:07.2643582Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.2645202Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:07.2646722Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.2648281Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:07.2649655Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:07.2651002Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:07.2652330Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:07.2653464Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:07.2654573Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:07.2655913Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:07.2657460Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:07.2658682Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:07.2659873Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:07.2661160Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:07.2662658Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:07.2663830Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.2664820Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.2665619Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:07.2666729Z W0507 20:32:07.259408 86873 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.7155715Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.7156311Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:07.7156723Z 
2025-05-07T20:32:07.7156873Z     @given(
2025-05-07T20:32:07.7157185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.7157598Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.7158314Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.7158666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.7159013Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.7159307Z     )
2025-05-07T20:32:07.7159677Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.7160152Z     def test_silu_mul_quant(
2025-05-07T20:32:07.7160398Z         self,
2025-05-07T20:32:07.7160596Z         T: int,
2025-05-07T20:32:07.7160795Z         D: int,
2025-05-07T20:32:07.7161010Z         scale_ub: Optional[float],
2025-05-07T20:32:07.7161291Z         contiguous: bool,
2025-05-07T20:32:07.7161533Z         compiled: bool,
2025-05-07T20:32:07.7161766Z     ) -> None:
2025-05-07T20:32:07.7161987Z         torch.manual_seed(2025)
2025-05-07T20:32:07.7162241Z     
2025-05-07T20:32:07.7162510Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.7162875Z     
2025-05-07T20:32:07.7163077Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.7163375Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.7163689Z         x = x_sign * x_clamp
2025-05-07T20:32:07.7163934Z         x0 = x[:, :D]
2025-05-07T20:32:07.7164154Z         x1 = x[:, D:]
2025-05-07T20:32:07.7164359Z     
2025-05-07T20:32:07.7164544Z         if contiguous:
2025-05-07T20:32:07.7164782Z             x0 = x0.contiguous()
2025-05-07T20:32:07.7165039Z             x1 = x1.contiguous()
2025-05-07T20:32:07.7165284Z     
2025-05-07T20:32:07.7165480Z         if scale_ub is not None:
2025-05-07T20:32:07.7165752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.7166097Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.7166419Z             )
2025-05-07T20:32:07.7166768Z         else:
2025-05-07T20:32:07.7166981Z             scale_ub_tensor = None
2025-05-07T20:32:07.7167245Z     
2025-05-07T20:32:07.7167481Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.7167818Z             op = silu_mul_quant
2025-05-07T20:32:07.7168083Z             if compiled:
2025-05-07T20:32:07.7168336Z                 op = torch.compile(op)
2025-05-07T20:32:07.7168651Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.7168947Z     
2025-05-07T20:32:07.7169150Z         y_fp8, y_scale = fn()
2025-05-07T20:32:07.7169442Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:07.7169753Z     
2025-05-07T20:32:07.7170004Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.7170348Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:07.7170656Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:07.7170989Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:07.7171367Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.7171704Z     
2025-05-07T20:32:07.7171911Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:07.7172119Z 
2025-05-07T20:32:07.7172236Z moe/activation_test.py:126: 
2025-05-07T20:32:07.7172540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.7172900Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:07.7173246Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.7174105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:07.7174934Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:07.7175525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.7176270Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.7177016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:07.7177918Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.7178786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:07.7179587Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.7180372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:07.7181061Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:07.7181705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:07.7182255Z     fn()
2025-05-07T20:32:07.7183125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:07.7183759Z     self.fn.run(
2025-05-07T20:32:07.7184255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.7184817Z     kernel = self.compile(
2025-05-07T20:32:07.7185389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.7186087Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.7186495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.7186745Z 
2025-05-07T20:32:07.7186962Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7d11e490>
2025-05-07T20:32:07.7188131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.7189952Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7d2f1700>}
2025-05-07T20:32:07.7191422Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.7192522Z context = <triton._C.libtriton.ir.context object at 0x7f1d7ca9eb70>
2025-05-07T20:32:07.7192836Z 
2025-05-07T20:32:07.7193007Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.7193557Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.7194048Z                            module_map=module_map)
2025-05-07T20:32:07.7194418Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.7194788Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:07.7195061Z E       ^
2025-05-07T20:32:07.7195550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.7196042Z 
2025-05-07T20:32:07.7196490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.7197044Z 
2025-05-07T20:32:07.7197153Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.7197581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.7197996Z     T=16384,
2025-05-07T20:32:07.7198191Z     D=5120,
2025-05-07T20:32:07.7198385Z     scale_ub=None,
2025-05-07T20:32:07.7198589Z     contiguous=True,
2025-05-07T20:32:07.7198808Z     compiled=True,
2025-05-07T20:32:07.7199011Z )
2025-05-07T20:32:07.7661385Z W0507 20:32:07.764492 86873 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:07.7663059Z W0507 20:32:07.764492 86873 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:07.7664534Z W0507 20:32:07.764492 86873 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:07.7665606Z W0507 20:32:07.764492 86873 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:07.7666812Z W0507 20:32:07.764492 86873 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:07.8889628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.8890161Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:07.8890547Z 
2025-05-07T20:32:07.8890657Z     @given(
2025-05-07T20:32:07.8890944Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.8891281Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.8891596Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.8891929Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.8892263Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.8892562Z     )
2025-05-07T20:32:07.8892942Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.8893402Z     def test_silu_mul_quant(
2025-05-07T20:32:07.8893652Z         self,
2025-05-07T20:32:07.8893845Z         T: int,
2025-05-07T20:32:07.8894034Z         D: int,
2025-05-07T20:32:07.8894255Z         scale_ub: Optional[float],
2025-05-07T20:32:07.8894529Z         contiguous: bool,
2025-05-07T20:32:07.8895026Z         compiled: bool,
2025-05-07T20:32:07.8895254Z     ) -> None:
2025-05-07T20:32:07.8895472Z         torch.manual_seed(2025)
2025-05-07T20:32:07.8895720Z     
2025-05-07T20:32:07.8895996Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.8896359Z     
2025-05-07T20:32:07.8896556Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.8896847Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.8897167Z         x = x_sign * x_clamp
2025-05-07T20:32:07.8897410Z         x0 = x[:, :D]
2025-05-07T20:32:07.8897623Z         x1 = x[:, D:]
2025-05-07T20:32:07.8897832Z     
2025-05-07T20:32:07.8898018Z         if contiguous:
2025-05-07T20:32:07.8898257Z             x0 = x0.contiguous()
2025-05-07T20:32:07.8898516Z             x1 = x1.contiguous()
2025-05-07T20:32:07.8898765Z     
2025-05-07T20:32:07.8898961Z         if scale_ub is not None:
2025-05-07T20:32:07.8899234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.8899579Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.8899902Z             )
2025-05-07T20:32:07.8900088Z         else:
2025-05-07T20:32:07.8900300Z             scale_ub_tensor = None
2025-05-07T20:32:07.8900559Z     
2025-05-07T20:32:07.8900790Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.8901113Z             op = silu_mul_quant
2025-05-07T20:32:07.8901366Z             if compiled:
2025-05-07T20:32:07.8901610Z                 op = torch.compile(op)
2025-05-07T20:32:07.8901914Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.8902197Z     
2025-05-07T20:32:07.8902383Z         y_fp8, y_scale = fn()
2025-05-07T20:32:07.8902671Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:07.8902975Z     
2025-05-07T20:32:07.8903215Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.8903556Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:07.8903857Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:07.8904186Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:07.8904548Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.8904872Z     
2025-05-07T20:32:07.8905231Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:07.8905436Z 
2025-05-07T20:32:07.8905536Z moe/activation_test.py:126: 
2025-05-07T20:32:07.8905843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.8906192Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:07.8906532Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.8907380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:07.8908207Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:07.8908828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.8909558Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.8910452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:07.8911230Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.8912039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:07.8912841Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.8913625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:07.8914310Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:07.8914954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:07.8915591Z     fn()
2025-05-07T20:32:07.8916130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:07.8916752Z     self.fn.run(
2025-05-07T20:32:07.8917241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.8917807Z     kernel = self.compile(
2025-05-07T20:32:07.8918403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.8919125Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.8919533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.8919783Z 
2025-05-07T20:32:07.8919996Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7d020700>
2025-05-07T20:32:07.8921173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.8922696Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7db55700>}
2025-05-07T20:32:07.8924161Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.8925267Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c4580b0>
2025-05-07T20:32:07.8925576Z 
2025-05-07T20:32:07.8925747Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.8926295Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.8926778Z                            module_map=module_map)
2025-05-07T20:32:07.8927160Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.8927525Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:07.8927799Z E       ^
2025-05-07T20:32:07.8928367Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.8928864Z 
2025-05-07T20:32:07.8929311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.8929866Z 
2025-05-07T20:32:07.8929975Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.8930395Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.8930812Z     T=1,
2025-05-07T20:32:07.8930990Z     D=5120,
2025-05-07T20:32:07.8931186Z     scale_ub=1200.0,
2025-05-07T20:32:07.8931404Z     contiguous=True,
2025-05-07T20:32:07.8931626Z     compiled=True,
2025-05-07T20:32:07.8931832Z )
2025-05-07T20:32:08.2742564Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.2743485Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:08.2743934Z 
2025-05-07T20:32:08.2744074Z     @given(
2025-05-07T20:32:08.2744452Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.2744967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.2745475Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.2746007Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.2746552Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.2747015Z     )
2025-05-07T20:32:08.2747623Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.2748374Z     def test_silu_mul_quant(
2025-05-07T20:32:08.2748735Z         self,
2025-05-07T20:32:08.2749010Z         T: int,
2025-05-07T20:32:08.2749301Z         D: int,
2025-05-07T20:32:08.2749626Z         scale_ub: Optional[float],
2025-05-07T20:32:08.2750736Z         contiguous: bool,
2025-05-07T20:32:08.2751114Z         compiled: bool,
2025-05-07T20:32:08.2751446Z     ) -> None:
2025-05-07T20:32:08.2751770Z         torch.manual_seed(2025)
2025-05-07T20:32:08.2752178Z     
2025-05-07T20:32:08.2752623Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.2753199Z     
2025-05-07T20:32:08.2753517Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.2754006Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.2754529Z         x = x_sign * x_clamp
2025-05-07T20:32:08.2754934Z         x0 = x[:, :D]
2025-05-07T20:32:08.2755290Z         x1 = x[:, D:]
2025-05-07T20:32:08.2755627Z     
2025-05-07T20:32:08.2755931Z         if contiguous:
2025-05-07T20:32:08.2756312Z             x0 = x0.contiguous()
2025-05-07T20:32:08.2756731Z             x1 = x1.contiguous()
2025-05-07T20:32:08.2757138Z     
2025-05-07T20:32:08.2757456Z         if scale_ub is not None:
2025-05-07T20:32:08.2757930Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.2758496Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.2759022Z             )
2025-05-07T20:32:08.2759350Z         else:
2025-05-07T20:32:08.2759701Z             scale_ub_tensor = None
2025-05-07T20:32:08.2760130Z     
2025-05-07T20:32:08.2760516Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.2761050Z             op = silu_mul_quant
2025-05-07T20:32:08.2761474Z             if compiled:
2025-05-07T20:32:08.2761891Z                 op = torch.compile(op)
2025-05-07T20:32:08.2762385Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.2762845Z     
2025-05-07T20:32:08.2763171Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.2763452Z 
2025-05-07T20:32:08.2763618Z moe/activation_test.py:117: 
2025-05-07T20:32:08.2764121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.2764694Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.2765187Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.2766149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.2767905Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.2769099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.2770280Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.2771149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.2772296Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.2773464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.2774418Z     kernel = self.compile(
2025-05-07T20:32:08.2775368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.2776537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.2777238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.2777642Z 
2025-05-07T20:32:08.2777989Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7d1001f0>
2025-05-07T20:32:08.2779926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.2782424Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7dabf5e0>}
2025-05-07T20:32:08.2785170Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.2787226Z context = <triton._C.libtriton.ir.context object at 0x7f1ceffb6030>
2025-05-07T20:32:08.2787758Z 
2025-05-07T20:32:08.2788046Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.2789015Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.2789951Z                            module_map=module_map)
2025-05-07T20:32:08.2790558Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.2791137Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.2791566Z E       ^
2025-05-07T20:32:08.2792366Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.2793187Z 
2025-05-07T20:32:08.2793929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.2794872Z 
2025-05-07T20:32:08.2795045Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.2795758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.2796447Z     T=1,
2025-05-07T20:32:08.2796760Z     D=5120,
2025-05-07T20:32:08.2797073Z     scale_ub=None,
2025-05-07T20:32:08.2797426Z     contiguous=False,
2025-05-07T20:32:08.2797803Z     compiled=True,
2025-05-07T20:32:08.2798137Z )
2025-05-07T20:32:08.3609195Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.3610081Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:08.3610521Z 
2025-05-07T20:32:08.3610648Z     @given(
2025-05-07T20:32:08.3611017Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.3611544Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.3612032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.3612608Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.3613151Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.3613530Z     )
2025-05-07T20:32:08.3614325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.3615030Z     def test_silu_mul_quant(
2025-05-07T20:32:08.3615395Z         self,
2025-05-07T20:32:08.3615701Z         T: int,
2025-05-07T20:32:08.3616007Z         D: int,
2025-05-07T20:32:08.3616356Z         scale_ub: Optional[float],
2025-05-07T20:32:08.3616775Z         contiguous: bool,
2025-05-07T20:32:08.3617157Z         compiled: bool,
2025-05-07T20:32:08.3617529Z     ) -> None:
2025-05-07T20:32:08.3617882Z         torch.manual_seed(2025)
2025-05-07T20:32:08.3618289Z     
2025-05-07T20:32:08.3618748Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.3619321Z     
2025-05-07T20:32:08.3619642Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.3620135Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.3620667Z         x = x_sign * x_clamp
2025-05-07T20:32:08.3621073Z         x0 = x[:, :D]
2025-05-07T20:32:08.3621435Z         x1 = x[:, D:]
2025-05-07T20:32:08.3621780Z     
2025-05-07T20:32:08.3622086Z         if contiguous:
2025-05-07T20:32:08.3634089Z             x0 = x0.contiguous()
2025-05-07T20:32:08.3634532Z             x1 = x1.contiguous()
2025-05-07T20:32:08.3634928Z     
2025-05-07T20:32:08.3635240Z         if scale_ub is not None:
2025-05-07T20:32:08.3635690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.3636271Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.3636793Z             )
2025-05-07T20:32:08.3637118Z         else:
2025-05-07T20:32:08.3637460Z             scale_ub_tensor = None
2025-05-07T20:32:08.3637886Z     
2025-05-07T20:32:08.3638275Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.3638811Z             op = silu_mul_quant
2025-05-07T20:32:08.3639520Z             if compiled:
2025-05-07T20:32:08.3639943Z                 op = torch.compile(op)
2025-05-07T20:32:08.3640436Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3640910Z     
2025-05-07T20:32:08.3641240Z         y_fp8, y_scale = fn()
2025-05-07T20:32:08.3641713Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:08.3642212Z     
2025-05-07T20:32:08.3642606Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.3643178Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:08.3643683Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:08.3644225Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:08.3644841Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.3645367Z     
2025-05-07T20:32:08.3645701Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:08.3646044Z 
2025-05-07T20:32:08.3646218Z moe/activation_test.py:126: 
2025-05-07T20:32:08.3646733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3647312Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:08.3647876Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.3649316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:08.3650667Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:08.3651617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.3652784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.3654005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:08.3655302Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:08.3656650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:08.3658144Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:08.3659502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:08.3660642Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:08.3661707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:08.3662625Z     fn()
2025-05-07T20:32:08.3663525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:08.3664567Z     self.fn.run(
2025-05-07T20:32:08.3665378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.3666327Z     kernel = self.compile(
2025-05-07T20:32:08.3667272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.3668434Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.3669118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3669510Z 
2025-05-07T20:32:08.3669957Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c395640>
2025-05-07T20:32:08.3671867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.3674341Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7cdb9af0>}
2025-05-07T20:32:08.3676698Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.3678634Z context = <triton._C.libtriton.ir.context object at 0x7f1ceff3acb0>
2025-05-07T20:32:08.3679150Z 
2025-05-07T20:32:08.3679425Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.3680312Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.3681118Z                            module_map=module_map)
2025-05-07T20:32:08.3681710Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.3682299Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:08.3682741Z E       ^
2025-05-07T20:32:08.3683983Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.3684793Z 
2025-05-07T20:32:08.3685535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.3686454Z 
2025-05-07T20:32:08.3686622Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.3687331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.3688015Z     T=1,
2025-05-07T20:32:08.3688315Z     D=5120,
2025-05-07T20:32:08.3688628Z     scale_ub=None,
2025-05-07T20:32:08.3688964Z     contiguous=True,
2025-05-07T20:32:08.3689325Z     compiled=False,
2025-05-07T20:32:08.3689655Z )
2025-05-07T20:32:08.5647655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.5648558Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:08.5649003Z 
2025-05-07T20:32:08.5649129Z     @given(
2025-05-07T20:32:08.5649501Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.5650008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.5650530Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.5651069Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.5651914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.5652318Z     )
2025-05-07T20:32:08.5652829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.5653522Z     def test_silu_mul_quant(
2025-05-07T20:32:08.5653890Z         self,
2025-05-07T20:32:08.5654191Z         T: int,
2025-05-07T20:32:08.5654502Z         D: int,
2025-05-07T20:32:08.5654835Z         scale_ub: Optional[float],
2025-05-07T20:32:08.5655253Z         contiguous: bool,
2025-05-07T20:32:08.5655640Z         compiled: bool,
2025-05-07T20:32:08.5656003Z     ) -> None:
2025-05-07T20:32:08.5656364Z         torch.manual_seed(2025)
2025-05-07T20:32:08.5656762Z     
2025-05-07T20:32:08.5657181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.5657750Z     
2025-05-07T20:32:08.5658057Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.5658520Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.5659036Z         x = x_sign * x_clamp
2025-05-07T20:32:08.5659432Z         x0 = x[:, :D]
2025-05-07T20:32:08.5659765Z         x1 = x[:, D:]
2025-05-07T20:32:08.5660099Z     
2025-05-07T20:32:08.5660394Z         if contiguous:
2025-05-07T20:32:08.5660756Z             x0 = x0.contiguous()
2025-05-07T20:32:08.5661180Z             x1 = x1.contiguous()
2025-05-07T20:32:08.5661570Z     
2025-05-07T20:32:08.5661876Z         if scale_ub is not None:
2025-05-07T20:32:08.5662319Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.5662865Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.5663368Z             )
2025-05-07T20:32:08.5663670Z         else:
2025-05-07T20:32:08.5664010Z             scale_ub_tensor = None
2025-05-07T20:32:08.5664420Z     
2025-05-07T20:32:08.5664779Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.5665559Z             op = silu_mul_quant
2025-05-07T20:32:08.5665964Z             if compiled:
2025-05-07T20:32:08.5666356Z                 op = torch.compile(op)
2025-05-07T20:32:08.5666847Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5667301Z     
2025-05-07T20:32:08.5667599Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.5667876Z 
2025-05-07T20:32:08.5668037Z moe/activation_test.py:117: 
2025-05-07T20:32:08.5668525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5669125Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.5669579Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5670935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.5672144Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.5673051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.5674253Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.5675416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.5676330Z     kernel = self.compile(
2025-05-07T20:32:08.5677255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.5678382Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.5679060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5679459Z 
2025-05-07T20:32:08.5679799Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c7d05b0>
2025-05-07T20:32:08.5681648Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.5684596Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7d2f18b0>}
2025-05-07T20:32:08.5686964Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.5688775Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c6d5f70>
2025-05-07T20:32:08.5689278Z 
2025-05-07T20:32:08.5689550Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.5690448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.5691258Z                            module_map=module_map)
2025-05-07T20:32:08.5691859Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.5692432Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.5692846Z E       ^
2025-05-07T20:32:08.5693645Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.5694447Z 
2025-05-07T20:32:08.5695181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.5696103Z 
2025-05-07T20:32:08.5696272Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.5696968Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.5697649Z     T=128,
2025-05-07T20:32:08.5697937Z     D=5120,
2025-05-07T20:32:08.5698242Z     scale_ub=None,
2025-05-07T20:32:08.5698588Z     contiguous=False,
2025-05-07T20:32:08.5698939Z     compiled=True,
2025-05-07T20:32:08.5699266Z )
2025-05-07T20:32:08.5699821Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.5700852Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:08.5701323Z 
2025-05-07T20:32:08.5701447Z     @given(
2025-05-07T20:32:08.5701826Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.5702344Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.5702845Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.5703395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.5703940Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.5704407Z     )
2025-05-07T20:32:08.5704994Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.5705729Z     def test_silu_mul_quant(
2025-05-07T20:32:08.5706107Z         self,
2025-05-07T20:32:08.5706417Z         T: int,
2025-05-07T20:32:08.5706730Z         D: int,
2025-05-07T20:32:08.5707070Z         scale_ub: Optional[float],
2025-05-07T20:32:08.5707510Z         contiguous: bool,
2025-05-07T20:32:08.5707913Z         compiled: bool,
2025-05-07T20:32:08.5708259Z     ) -> None:
2025-05-07T20:32:08.5708619Z         torch.manual_seed(2025)
2025-05-07T20:32:08.5709012Z     
2025-05-07T20:32:08.5709499Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.5710213Z     
2025-05-07T20:32:08.5710513Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.5710987Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.5711507Z         x = x_sign * x_clamp
2025-05-07T20:32:08.5711889Z         x0 = x[:, :D]
2025-05-07T20:32:08.5712226Z         x1 = x[:, D:]
2025-05-07T20:32:08.5712564Z     
2025-05-07T20:32:08.5712848Z         if contiguous:
2025-05-07T20:32:08.5713219Z             x0 = x0.contiguous()
2025-05-07T20:32:08.5713640Z             x1 = x1.contiguous()
2025-05-07T20:32:08.5714032Z     
2025-05-07T20:32:08.5714331Z         if scale_ub is not None:
2025-05-07T20:32:08.5714781Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.5715340Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.5715844Z             )
2025-05-07T20:32:08.5716151Z         else:
2025-05-07T20:32:08.5716653Z             scale_ub_tensor = None
2025-05-07T20:32:08.5717059Z     
2025-05-07T20:32:08.5717426Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.5717946Z             op = silu_mul_quant
2025-05-07T20:32:08.5718340Z             if compiled:
2025-05-07T20:32:08.5718738Z                 op = torch.compile(op)
2025-05-07T20:32:08.5719219Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5719656Z     
2025-05-07T20:32:08.5719955Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.5720224Z 
2025-05-07T20:32:08.5720384Z moe/activation_test.py:117: 
2025-05-07T20:32:08.5720865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5721409Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.5721870Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5722831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.5723799Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.5724930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.5726043Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.5726947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.5728130Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.5729294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.5730220Z     kernel = self.compile(
2025-05-07T20:32:08.5731143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.5732433Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.5733111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5733514Z 
2025-05-07T20:32:08.5733865Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7d936040>
2025-05-07T20:32:08.5735764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.5738150Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7cd1ee50>}
2025-05-07T20:32:08.5740603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.5742286Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c327070>
2025-05-07T20:32:08.5742782Z 
2025-05-07T20:32:08.5743072Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.5743962Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.5744764Z                            module_map=module_map)
2025-05-07T20:32:08.5745361Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.5745933Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.5746368Z E       ^
2025-05-07T20:32:08.5747162Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.5747963Z 
2025-05-07T20:32:08.5748750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.5749663Z 
2025-05-07T20:32:08.5749934Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.5750628Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.5751311Z     T=128,
2025-05-07T20:32:08.5751735Z     D=7168,
2025-05-07T20:32:08.5752055Z     scale_ub=1200.0,
2025-05-07T20:32:08.5752416Z     contiguous=False,
2025-05-07T20:32:08.5752773Z     compiled=False,
2025-05-07T20:32:08.5753115Z )
2025-05-07T20:32:08.7280177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.7281073Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:08.7281551Z 
2025-05-07T20:32:08.7281682Z     @given(
2025-05-07T20:32:08.7282044Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.7282545Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.7283400Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.7283949Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.7284440Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.7284836Z     )
2025-05-07T20:32:08.7285363Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.7286047Z     def test_silu_mul_quant(
2025-05-07T20:32:08.7286423Z         self,
2025-05-07T20:32:08.7286733Z         T: int,
2025-05-07T20:32:08.7287028Z         D: int,
2025-05-07T20:32:08.7287362Z         scale_ub: Optional[float],
2025-05-07T20:32:08.7287786Z         contiguous: bool,
2025-05-07T20:32:08.7288173Z         compiled: bool,
2025-05-07T20:32:08.7288550Z     ) -> None:
2025-05-07T20:32:08.7288889Z         torch.manual_seed(2025)
2025-05-07T20:32:08.7289275Z     
2025-05-07T20:32:08.7289697Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.7290274Z     
2025-05-07T20:32:08.7290580Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.7291038Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.7291919Z         x = x_sign * x_clamp
2025-05-07T20:32:08.7292322Z         x0 = x[:, :D]
2025-05-07T20:32:08.7292668Z         x1 = x[:, D:]
2025-05-07T20:32:08.7293004Z     
2025-05-07T20:32:08.7293305Z         if contiguous:
2025-05-07T20:32:08.7293674Z             x0 = x0.contiguous()
2025-05-07T20:32:08.7294102Z             x1 = x1.contiguous()
2025-05-07T20:32:08.7294488Z     
2025-05-07T20:32:08.7294794Z         if scale_ub is not None:
2025-05-07T20:32:08.7295239Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.7295793Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.7296297Z             )
2025-05-07T20:32:08.7296611Z         else:
2025-05-07T20:32:08.7296946Z             scale_ub_tensor = None
2025-05-07T20:32:08.7297362Z     
2025-05-07T20:32:08.7297730Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.7298254Z             op = silu_mul_quant
2025-05-07T20:32:08.7298662Z             if compiled:
2025-05-07T20:32:08.7299059Z                 op = torch.compile(op)
2025-05-07T20:32:08.7299543Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.7300003Z     
2025-05-07T20:32:08.7300304Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.7300589Z 
2025-05-07T20:32:08.7300746Z moe/activation_test.py:117: 
2025-05-07T20:32:08.7301239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.7301784Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.7302247Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.7303447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.7304649Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.7305574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.7306773Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.7307939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.7308919Z     kernel = self.compile(
2025-05-07T20:32:08.7310230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.7311393Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.7312056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.7312435Z 
2025-05-07T20:32:08.7312771Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c7e2e20>
2025-05-07T20:32:08.7314594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.7317064Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7cdb9c10>}
2025-05-07T20:32:08.7319490Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.7321304Z context = <triton._C.libtriton.ir.context object at 0x7f1cefe79cf0>
2025-05-07T20:32:08.7321803Z 
2025-05-07T20:32:08.7322073Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.7322971Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.7323778Z                            module_map=module_map)
2025-05-07T20:32:08.7324354Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.7324929Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.7325352Z E       ^
2025-05-07T20:32:08.7326291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.7327092Z 
2025-05-07T20:32:08.7327829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.7328738Z 
2025-05-07T20:32:08.7328901Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.7329595Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.7330268Z     T=128,
2025-05-07T20:32:08.7330566Z     D=5120,
2025-05-07T20:32:08.7330870Z     scale_ub=None,
2025-05-07T20:32:08.7331214Z     contiguous=False,
2025-05-07T20:32:08.7331567Z     compiled=False,
2025-05-07T20:32:08.7331898Z )
2025-05-07T20:32:08.7332425Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.7333253Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:08.7333742Z 
2025-05-07T20:32:08.7333863Z     @given(
2025-05-07T20:32:08.7334230Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.7334740Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.7335246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.7335789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.7336324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.7336800Z     )
2025-05-07T20:32:08.7337383Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.7338146Z     def test_silu_mul_quant(
2025-05-07T20:32:08.7338523Z         self,
2025-05-07T20:32:08.7338841Z         T: int,
2025-05-07T20:32:08.7339190Z         D: int,
2025-05-07T20:32:08.7339527Z         scale_ub: Optional[float],
2025-05-07T20:32:08.7339970Z         contiguous: bool,
2025-05-07T20:32:08.7340358Z         compiled: bool,
2025-05-07T20:32:08.7340710Z     ) -> None:
2025-05-07T20:32:08.7341059Z         torch.manual_seed(2025)
2025-05-07T20:32:08.7341457Z     
2025-05-07T20:32:08.7341884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.7342456Z     
2025-05-07T20:32:08.7342894Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.7343363Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.7343878Z         x = x_sign * x_clamp
2025-05-07T20:32:08.7344271Z         x0 = x[:, :D]
2025-05-07T20:32:08.7344607Z         x1 = x[:, D:]
2025-05-07T20:32:08.7344939Z     
2025-05-07T20:32:08.7345225Z         if contiguous:
2025-05-07T20:32:08.7345590Z             x0 = x0.contiguous()
2025-05-07T20:32:08.7345998Z             x1 = x1.contiguous()
2025-05-07T20:32:08.7346387Z     
2025-05-07T20:32:08.7346698Z         if scale_ub is not None:
2025-05-07T20:32:08.7347136Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.7347685Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.7348196Z             )
2025-05-07T20:32:08.7348529Z         else:
2025-05-07T20:32:08.7348889Z             scale_ub_tensor = None
2025-05-07T20:32:08.7349303Z     
2025-05-07T20:32:08.7349663Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.7350272Z             op = silu_mul_quant
2025-05-07T20:32:08.7350676Z             if compiled:
2025-05-07T20:32:08.7351065Z                 op = torch.compile(op)
2025-05-07T20:32:08.7351570Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.7352021Z     
2025-05-07T20:32:08.7352314Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.7352595Z 
2025-05-07T20:32:08.7352750Z moe/activation_test.py:117: 
2025-05-07T20:32:08.7353229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.7353779Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.7354240Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.7355424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.7356824Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.7357704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.7358816Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.7359971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.7360898Z     kernel = self.compile(
2025-05-07T20:32:08.7361823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.7362978Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.7375845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.7376273Z 
2025-05-07T20:32:08.7376617Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c3d5640>
2025-05-07T20:32:08.7378545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.7381005Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7cd8da60>}
2025-05-07T20:32:08.7383699Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.7385511Z context = <triton._C.libtriton.ir.context object at 0x7f1ceff58870>
2025-05-07T20:32:08.7386022Z 
2025-05-07T20:32:08.7386295Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.7387190Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.7387994Z                            module_map=module_map)
2025-05-07T20:32:08.7388594Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.7389405Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.7389910Z E       ^
2025-05-07T20:32:08.7390714Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.7391520Z 
2025-05-07T20:32:08.7392244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.7393148Z 
2025-05-07T20:32:08.7393323Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.7394008Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.7394693Z     T=128,
2025-05-07T20:32:08.7394997Z     D=5120,
2025-05-07T20:32:08.7395300Z     scale_ub=1200.0,
2025-05-07T20:32:08.7395673Z     contiguous=True,
2025-05-07T20:32:08.7396034Z     compiled=False,
2025-05-07T20:32:08.7396358Z )
2025-05-07T20:32:08.9702082Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.9702882Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:08.9703249Z 
2025-05-07T20:32:08.9703361Z     @given(
2025-05-07T20:32:08.9703653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.9704025Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.9704341Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.9704675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.9705014Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.9705304Z     )
2025-05-07T20:32:08.9705660Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.9706134Z     def test_silu_mul_quant(
2025-05-07T20:32:08.9706385Z         self,
2025-05-07T20:32:08.9706886Z         T: int,
2025-05-07T20:32:08.9707086Z         D: int,
2025-05-07T20:32:08.9707305Z         scale_ub: Optional[float],
2025-05-07T20:32:08.9707583Z         contiguous: bool,
2025-05-07T20:32:08.9707824Z         compiled: bool,
2025-05-07T20:32:08.9708057Z     ) -> None:
2025-05-07T20:32:08.9708286Z         torch.manual_seed(2025)
2025-05-07T20:32:08.9708561Z     
2025-05-07T20:32:08.9708840Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.9709202Z     
2025-05-07T20:32:08.9709390Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.9709819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.9710140Z         x = x_sign * x_clamp
2025-05-07T20:32:08.9710376Z         x0 = x[:, :D]
2025-05-07T20:32:08.9710592Z         x1 = x[:, D:]
2025-05-07T20:32:08.9710800Z     
2025-05-07T20:32:08.9710983Z         if contiguous:
2025-05-07T20:32:08.9711220Z             x0 = x0.contiguous()
2025-05-07T20:32:08.9711488Z             x1 = x1.contiguous()
2025-05-07T20:32:08.9711727Z     
2025-05-07T20:32:08.9711919Z         if scale_ub is not None:
2025-05-07T20:32:08.9712198Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.9712542Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.9712867Z             )
2025-05-07T20:32:08.9713059Z         else:
2025-05-07T20:32:08.9713275Z             scale_ub_tensor = None
2025-05-07T20:32:08.9713530Z     
2025-05-07T20:32:08.9713762Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.9714086Z             op = silu_mul_quant
2025-05-07T20:32:08.9714336Z             if compiled:
2025-05-07T20:32:08.9714587Z                 op = torch.compile(op)
2025-05-07T20:32:08.9714891Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.9715166Z     
2025-05-07T20:32:08.9715361Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.9715529Z 
2025-05-07T20:32:08.9715633Z moe/activation_test.py:117: 
2025-05-07T20:32:08.9715934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9716280Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.9716570Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.9717463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.9718210Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.9718779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.9719512Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.9720211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.9720779Z     kernel = self.compile(
2025-05-07T20:32:08.9721351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.9722057Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.9722467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9722722Z 
2025-05-07T20:32:08.9722935Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ceff53220>
2025-05-07T20:32:08.9724108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.9725627Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7cd8f550>}
2025-05-07T20:32:08.9727093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.9728285Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c2c51f0>
2025-05-07T20:32:08.9728596Z 
2025-05-07T20:32:08.9728770Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.9729370Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.9729857Z                            module_map=module_map)
2025-05-07T20:32:08.9730234Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.9730601Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.9730865Z E       ^
2025-05-07T20:32:08.9731346Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.9731844Z 
2025-05-07T20:32:08.9732291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.9732856Z 
2025-05-07T20:32:08.9732966Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.9733397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.9733813Z     T=1,
2025-05-07T20:32:08.9734001Z     D=7168,
2025-05-07T20:32:08.9734197Z     scale_ub=1200.0,
2025-05-07T20:32:08.9734414Z     contiguous=True,
2025-05-07T20:32:08.9734643Z     compiled=True,
2025-05-07T20:32:08.9734854Z )
2025-05-07T20:32:08.9735177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.9735690Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:08.9735963Z 
2025-05-07T20:32:08.9736047Z     @given(
2025-05-07T20:32:08.9736273Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.9736595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.9736911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.9737256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.9737595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.9737893Z     )
2025-05-07T20:32:08.9738257Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.9738799Z     def test_silu_mul_quant(
2025-05-07T20:32:08.9739044Z         self,
2025-05-07T20:32:08.9739236Z         T: int,
2025-05-07T20:32:08.9739426Z         D: int,
2025-05-07T20:32:08.9739643Z         scale_ub: Optional[float],
2025-05-07T20:32:08.9739914Z         contiguous: bool,
2025-05-07T20:32:08.9740150Z         compiled: bool,
2025-05-07T20:32:08.9740374Z     ) -> None:
2025-05-07T20:32:08.9740588Z         torch.manual_seed(2025)
2025-05-07T20:32:08.9740827Z     
2025-05-07T20:32:08.9741107Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.9741463Z     
2025-05-07T20:32:08.9741650Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.9741946Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.9742270Z         x = x_sign * x_clamp
2025-05-07T20:32:08.9742515Z         x0 = x[:, :D]
2025-05-07T20:32:08.9742728Z         x1 = x[:, D:]
2025-05-07T20:32:08.9742938Z     
2025-05-07T20:32:08.9743123Z         if contiguous:
2025-05-07T20:32:08.9743348Z             x0 = x0.contiguous()
2025-05-07T20:32:08.9743612Z             x1 = x1.contiguous()
2025-05-07T20:32:08.9743857Z     
2025-05-07T20:32:08.9744041Z         if scale_ub is not None:
2025-05-07T20:32:08.9744317Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.9744657Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.9744971Z             )
2025-05-07T20:32:08.9745163Z         else:
2025-05-07T20:32:08.9745369Z             scale_ub_tensor = None
2025-05-07T20:32:08.9745820Z     
2025-05-07T20:32:08.9746053Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.9746380Z             op = silu_mul_quant
2025-05-07T20:32:08.9746630Z             if compiled:
2025-05-07T20:32:08.9746973Z                 op = torch.compile(op)
2025-05-07T20:32:08.9747277Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.9747563Z     
2025-05-07T20:32:08.9747751Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.9747930Z 
2025-05-07T20:32:08.9748027Z moe/activation_test.py:117: 
2025-05-07T20:32:08.9748330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9748669Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.9748956Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.9749547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.9750199Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.9750907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.9751650Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.9752222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.9752951Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.9753671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.9754241Z     kernel = self.compile(
2025-05-07T20:32:08.9754805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.9755506Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.9755920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.9756161Z 
2025-05-07T20:32:08.9756376Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c2faa00>
2025-05-07T20:32:08.9757555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.9759199Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7cd8d160>}
2025-05-07T20:32:08.9760662Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.9761776Z context = <triton._C.libtriton.ir.context object at 0x7f1cefdffcb0>
2025-05-07T20:32:08.9762090Z 
2025-05-07T20:32:08.9762258Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.9762813Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.9763298Z                            module_map=module_map)
2025-05-07T20:32:08.9763677Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.9764040Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.9764302Z E       ^
2025-05-07T20:32:08.9764801Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.9765298Z 
2025-05-07T20:32:08.9765746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.9766300Z 
2025-05-07T20:32:08.9766409Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.9766831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.9767257Z     T=1,
2025-05-07T20:32:08.9767444Z     D=7168,
2025-05-07T20:32:08.9767631Z     scale_ub=1200.0,
2025-05-07T20:32:08.9767856Z     contiguous=False,
2025-05-07T20:32:08.9768083Z     compiled=True,
2025-05-07T20:32:08.9768282Z )
2025-05-07T20:32:09.3572215Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.3573225Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:09.3573612Z 
2025-05-07T20:32:09.3573703Z     @given(
2025-05-07T20:32:09.3573946Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.3574270Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.3574586Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.3574928Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.3575262Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.3575557Z     )
2025-05-07T20:32:09.3575922Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.3576388Z     def test_silu_mul_quant(
2025-05-07T20:32:09.3576633Z         self,
2025-05-07T20:32:09.3576830Z         T: int,
2025-05-07T20:32:09.3577023Z         D: int,
2025-05-07T20:32:09.3577242Z         scale_ub: Optional[float],
2025-05-07T20:32:09.3577526Z         contiguous: bool,
2025-05-07T20:32:09.3577762Z         compiled: bool,
2025-05-07T20:32:09.3577994Z     ) -> None:
2025-05-07T20:32:09.3578208Z         torch.manual_seed(2025)
2025-05-07T20:32:09.3578459Z     
2025-05-07T20:32:09.3578737Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.3579098Z     
2025-05-07T20:32:09.3579286Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.3579585Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.3579907Z         x = x_sign * x_clamp
2025-05-07T20:32:09.3580156Z         x0 = x[:, :D]
2025-05-07T20:32:09.3580368Z         x1 = x[:, D:]
2025-05-07T20:32:09.3580579Z     
2025-05-07T20:32:09.3580765Z         if contiguous:
2025-05-07T20:32:09.3580993Z             x0 = x0.contiguous()
2025-05-07T20:32:09.3581259Z             x1 = x1.contiguous()
2025-05-07T20:32:09.3581507Z     
2025-05-07T20:32:09.3581688Z         if scale_ub is not None:
2025-05-07T20:32:09.3581972Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.3582319Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.3582634Z             )
2025-05-07T20:32:09.3583133Z         else:
2025-05-07T20:32:09.3583502Z             scale_ub_tensor = None
2025-05-07T20:32:09.3583757Z     
2025-05-07T20:32:09.3583988Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.3584313Z             op = silu_mul_quant
2025-05-07T20:32:09.3584562Z             if compiled:
2025-05-07T20:32:09.3584812Z                 op = torch.compile(op)
2025-05-07T20:32:09.3585132Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3585416Z     
2025-05-07T20:32:09.3585600Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.3585775Z 
2025-05-07T20:32:09.3585873Z moe/activation_test.py:117: 
2025-05-07T20:32:09.3586173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3586514Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.3586804Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.3587395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.3587998Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.3588696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.3589438Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.3590135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.3590868Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.3591569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.3592137Z     kernel = self.compile(
2025-05-07T20:32:09.3592713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.3593571Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.3593991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.3594240Z 
2025-05-07T20:32:09.3594452Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefddbd30>
2025-05-07T20:32:09.3595618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.3597131Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7c6c1670>}
2025-05-07T20:32:09.3598600Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.3599716Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c0b98b0>
2025-05-07T20:32:09.3600020Z 
2025-05-07T20:32:09.3600198Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.3600746Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.3601231Z                            module_map=module_map)
2025-05-07T20:32:09.3601608Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.3601972Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.3602229Z E       ^
2025-05-07T20:32:09.3602725Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.3603213Z 
2025-05-07T20:32:09.3603666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.3604228Z 
2025-05-07T20:32:09.3604336Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.3604756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.3605286Z     T=1,
2025-05-07T20:32:09.3605477Z     D=7168,
2025-05-07T20:32:09.3605665Z     scale_ub=None,
2025-05-07T20:32:09.3605886Z     contiguous=False,
2025-05-07T20:32:09.3606113Z     compiled=True,
2025-05-07T20:32:09.3606313Z )
2025-05-07T20:32:09.4771252Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.4771980Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:09.4772277Z 
2025-05-07T20:32:09.4772360Z     @given(
2025-05-07T20:32:09.4772597Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.4772922Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.4773245Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.4773611Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.4773948Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.4774253Z     )
2025-05-07T20:32:09.4774639Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.4775125Z     def test_silu_mul_quant(
2025-05-07T20:32:09.4775374Z         self,
2025-05-07T20:32:09.4775578Z         T: int,
2025-05-07T20:32:09.4775783Z         D: int,
2025-05-07T20:32:09.4776003Z         scale_ub: Optional[float],
2025-05-07T20:32:09.4776289Z         contiguous: bool,
2025-05-07T20:32:09.4776537Z         compiled: bool,
2025-05-07T20:32:09.4776765Z     ) -> None:
2025-05-07T20:32:09.4776987Z         torch.manual_seed(2025)
2025-05-07T20:32:09.4777243Z     
2025-05-07T20:32:09.4777520Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.4777888Z     
2025-05-07T20:32:09.4778085Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.4778385Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.4779023Z         x = x_sign * x_clamp
2025-05-07T20:32:09.4779280Z         x0 = x[:, :D]
2025-05-07T20:32:09.4779500Z         x1 = x[:, D:]
2025-05-07T20:32:09.4779719Z     
2025-05-07T20:32:09.4779918Z         if contiguous:
2025-05-07T20:32:09.4780163Z             x0 = x0.contiguous()
2025-05-07T20:32:09.4780430Z             x1 = x1.contiguous()
2025-05-07T20:32:09.4780683Z     
2025-05-07T20:32:09.4780884Z         if scale_ub is not None:
2025-05-07T20:32:09.4781166Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.4781517Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.4781851Z             )
2025-05-07T20:32:09.4782039Z         else:
2025-05-07T20:32:09.4782249Z             scale_ub_tensor = None
2025-05-07T20:32:09.4782510Z     
2025-05-07T20:32:09.4783032Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.4783369Z             op = silu_mul_quant
2025-05-07T20:32:09.4783636Z             if compiled:
2025-05-07T20:32:09.4783880Z                 op = torch.compile(op)
2025-05-07T20:32:09.4784192Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.4784482Z     
2025-05-07T20:32:09.4784676Z         y_fp8, y_scale = fn()
2025-05-07T20:32:09.4784969Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:09.4785271Z     
2025-05-07T20:32:09.4785511Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.4785858Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:09.4786162Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:09.4786488Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:09.4786859Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.4787184Z     
2025-05-07T20:32:09.4787384Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:09.4787591Z 
2025-05-07T20:32:09.4787693Z moe/activation_test.py:126: 
2025-05-07T20:32:09.4787998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.4788356Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:09.4788698Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.4789878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:09.4790722Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:09.4791316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.4792062Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.4792823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:09.4793618Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.4794448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:09.4795268Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.4796071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:09.4796770Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:09.4797424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:09.4797985Z     fn()
2025-05-07T20:32:09.4798531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:09.4799165Z     self.fn.run(
2025-05-07T20:32:09.4799659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.4800235Z     kernel = self.compile(
2025-05-07T20:32:09.4800938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.4801660Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.4802079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.4802331Z 
2025-05-07T20:32:09.4802562Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c082a60>
2025-05-07T20:32:09.4803764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.4805324Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7c188430>}
2025-05-07T20:32:09.4806826Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.4807972Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c1862f0>
2025-05-07T20:32:09.4808287Z 
2025-05-07T20:32:09.4808462Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.4809029Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.4809525Z                            module_map=module_map)
2025-05-07T20:32:09.4809910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.4810284Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:09.4810560Z E       ^
2025-05-07T20:32:09.4812270Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.4812776Z 
2025-05-07T20:32:09.4813239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.4813808Z 
2025-05-07T20:32:09.4813918Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.4814490Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.4823212Z     T=1,
2025-05-07T20:32:09.4823420Z     D=5120,
2025-05-07T20:32:09.4823617Z     scale_ub=1200.0,
2025-05-07T20:32:09.4823849Z     contiguous=False,
2025-05-07T20:32:09.4824075Z     compiled=True,
2025-05-07T20:32:09.4824285Z )
2025-05-07T20:32:09.6835880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.6836566Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:09.6836963Z 
2025-05-07T20:32:09.6837071Z     @given(
2025-05-07T20:32:09.6837400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.6837785Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.6838136Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.6838492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.6838840Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.6839161Z     )
2025-05-07T20:32:09.6839542Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.6840016Z     def test_silu_mul_quant(
2025-05-07T20:32:09.6840279Z         self,
2025-05-07T20:32:09.6840490Z         T: int,
2025-05-07T20:32:09.6840693Z         D: int,
2025-05-07T20:32:09.6840929Z         scale_ub: Optional[float],
2025-05-07T20:32:09.6841222Z         contiguous: bool,
2025-05-07T20:32:09.6841479Z         compiled: bool,
2025-05-07T20:32:09.6841716Z     ) -> None:
2025-05-07T20:32:09.6841946Z         torch.manual_seed(2025)
2025-05-07T20:32:09.6842209Z     
2025-05-07T20:32:09.6842492Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.6842863Z     
2025-05-07T20:32:09.6843406Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.6843708Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.6844041Z         x = x_sign * x_clamp
2025-05-07T20:32:09.6844299Z         x0 = x[:, :D]
2025-05-07T20:32:09.6844522Z         x1 = x[:, D:]
2025-05-07T20:32:09.6844736Z     
2025-05-07T20:32:09.6844928Z         if contiguous:
2025-05-07T20:32:09.6845158Z             x0 = x0.contiguous()
2025-05-07T20:32:09.6845426Z             x1 = x1.contiguous()
2025-05-07T20:32:09.6845675Z     
2025-05-07T20:32:09.6845881Z         if scale_ub is not None:
2025-05-07T20:32:09.6846157Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.6846508Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.6846832Z             )
2025-05-07T20:32:09.6847025Z         else:
2025-05-07T20:32:09.6847241Z             scale_ub_tensor = None
2025-05-07T20:32:09.6847503Z     
2025-05-07T20:32:09.6847738Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.6848073Z             op = silu_mul_quant
2025-05-07T20:32:09.6848334Z             if compiled:
2025-05-07T20:32:09.6848587Z                 op = torch.compile(op)
2025-05-07T20:32:09.6848896Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.6849186Z     
2025-05-07T20:32:09.6849384Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.6849555Z 
2025-05-07T20:32:09.6849655Z moe/activation_test.py:117: 
2025-05-07T20:32:09.6849965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.6850314Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.6850597Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.6851195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.6851801Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.6852518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.6853265Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.6853983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.6854723Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.6855437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.6856006Z     kernel = self.compile(
2025-05-07T20:32:09.6856586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.6857289Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.6857702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.6857956Z 
2025-05-07T20:32:09.6858171Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c2a4220>
2025-05-07T20:32:09.6859357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.6860882Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7c188e50>}
2025-05-07T20:32:09.6862356Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.6863462Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c052a70>
2025-05-07T20:32:09.6863775Z 
2025-05-07T20:32:09.6863946Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.6864498Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.6865084Z                            module_map=module_map)
2025-05-07T20:32:09.6865460Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.6865835Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.6866106Z E       ^
2025-05-07T20:32:09.6866593Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.6867091Z 
2025-05-07T20:32:09.6867541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.6868104Z 
2025-05-07T20:32:09.6868214Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.6868649Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.6869120Z     T=1,
2025-05-07T20:32:09.6869312Z     D=5120,
2025-05-07T20:32:09.6869510Z     scale_ub=1200.0,
2025-05-07T20:32:09.6869979Z     contiguous=False,
2025-05-07T20:32:09.6870208Z     compiled=False,
2025-05-07T20:32:09.6870425Z )
2025-05-07T20:32:09.6870754Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.6871274Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:09.6871567Z 
2025-05-07T20:32:09.6871647Z     @given(
2025-05-07T20:32:09.6871881Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.6872200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.6872518Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.6872861Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.6873205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.6873498Z     )
2025-05-07T20:32:09.6873863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.6874336Z     def test_silu_mul_quant(
2025-05-07T20:32:09.6874579Z         self,
2025-05-07T20:32:09.6874788Z         T: int,
2025-05-07T20:32:09.6874994Z         D: int,
2025-05-07T20:32:09.6875208Z         scale_ub: Optional[float],
2025-05-07T20:32:09.6875492Z         contiguous: bool,
2025-05-07T20:32:09.6875833Z         compiled: bool,
2025-05-07T20:32:09.6876059Z     ) -> None:
2025-05-07T20:32:09.6876280Z         torch.manual_seed(2025)
2025-05-07T20:32:09.6876535Z     
2025-05-07T20:32:09.6876805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.6877172Z     
2025-05-07T20:32:09.6877368Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.6877664Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.6877987Z         x = x_sign * x_clamp
2025-05-07T20:32:09.6878231Z         x0 = x[:, :D]
2025-05-07T20:32:09.6878458Z         x1 = x[:, D:]
2025-05-07T20:32:09.6878667Z     
2025-05-07T20:32:09.6878858Z         if contiguous:
2025-05-07T20:32:09.6879095Z             x0 = x0.contiguous()
2025-05-07T20:32:09.6879355Z             x1 = x1.contiguous()
2025-05-07T20:32:09.6879608Z     
2025-05-07T20:32:09.6879807Z         if scale_ub is not None:
2025-05-07T20:32:09.6880085Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.6880442Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.6880775Z             )
2025-05-07T20:32:09.6880965Z         else:
2025-05-07T20:32:09.6881184Z             scale_ub_tensor = None
2025-05-07T20:32:09.6881449Z     
2025-05-07T20:32:09.6881678Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.6882014Z             op = silu_mul_quant
2025-05-07T20:32:09.6882275Z             if compiled:
2025-05-07T20:32:09.6882525Z                 op = torch.compile(op)
2025-05-07T20:32:09.6883126Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.6883415Z     
2025-05-07T20:32:09.6883617Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.6883784Z 
2025-05-07T20:32:09.6883880Z moe/activation_test.py:117: 
2025-05-07T20:32:09.6884315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.6884659Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.6884942Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.6885685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.6886428Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.6886995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.6887723Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.6888447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.6889050Z     kernel = self.compile(
2025-05-07T20:32:09.6889616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.6890323Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.6890735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.6890981Z 
2025-05-07T20:32:09.6891199Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c041ac0>
2025-05-07T20:32:09.6892361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.6893866Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefb79820>}
2025-05-07T20:32:09.6895336Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.6896452Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c2a99f0>
2025-05-07T20:32:09.6896756Z 
2025-05-07T20:32:09.6897050Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.6897596Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.6898088Z                            module_map=module_map)
2025-05-07T20:32:09.6898467Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.6898825Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.6899089Z E       ^
2025-05-07T20:32:09.6899584Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.6900072Z 
2025-05-07T20:32:09.6900526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.6901086Z 
2025-05-07T20:32:09.6901187Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.6901620Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.6902042Z     T=16384,
2025-05-07T20:32:09.6902241Z     D=5120,
2025-05-07T20:32:09.6902439Z     scale_ub=1200.0,
2025-05-07T20:32:09.6902662Z     contiguous=False,
2025-05-07T20:32:09.6902883Z     compiled=True,
2025-05-07T20:32:09.6903091Z )
2025-05-07T20:32:09.8110100Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.8110860Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:09.8111281Z 
2025-05-07T20:32:09.8111394Z     @given(
2025-05-07T20:32:09.8111715Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.8112160Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.8112588Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.8113047Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.8113829Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.8114204Z     )
2025-05-07T20:32:09.8114663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.8115131Z     def test_silu_mul_quant(
2025-05-07T20:32:09.8115379Z         self,
2025-05-07T20:32:09.8115567Z         T: int,
2025-05-07T20:32:09.8115767Z         D: int,
2025-05-07T20:32:09.8115984Z         scale_ub: Optional[float],
2025-05-07T20:32:09.8116254Z         contiguous: bool,
2025-05-07T20:32:09.8116502Z         compiled: bool,
2025-05-07T20:32:09.8116749Z     ) -> None:
2025-05-07T20:32:09.8116970Z         torch.manual_seed(2025)
2025-05-07T20:32:09.8117222Z     
2025-05-07T20:32:09.8117498Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.8117897Z     
2025-05-07T20:32:09.8118088Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.8118390Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.8118722Z         x = x_sign * x_clamp
2025-05-07T20:32:09.8118966Z         x0 = x[:, :D]
2025-05-07T20:32:09.8119190Z         x1 = x[:, D:]
2025-05-07T20:32:09.8119409Z     
2025-05-07T20:32:09.8119598Z         if contiguous:
2025-05-07T20:32:09.8119838Z             x0 = x0.contiguous()
2025-05-07T20:32:09.8120110Z             x1 = x1.contiguous()
2025-05-07T20:32:09.8120363Z     
2025-05-07T20:32:09.8120557Z         if scale_ub is not None:
2025-05-07T20:32:09.8120846Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.8121193Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.8121507Z             )
2025-05-07T20:32:09.8121697Z         else:
2025-05-07T20:32:09.8121909Z             scale_ub_tensor = None
2025-05-07T20:32:09.8122160Z     
2025-05-07T20:32:09.8122390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.8122714Z             op = silu_mul_quant
2025-05-07T20:32:09.8122962Z             if compiled:
2025-05-07T20:32:09.8123215Z                 op = torch.compile(op)
2025-05-07T20:32:09.8123521Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.8123799Z     
2025-05-07T20:32:09.8123993Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.8124339Z 
2025-05-07T20:32:09.8124447Z moe/activation_test.py:117: 
2025-05-07T20:32:09.8124747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.8125088Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.8125373Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.8125965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.8126560Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.8127267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.8128014Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.8128584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.8129310Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.8130021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.8130587Z     kernel = self.compile(
2025-05-07T20:32:09.8131154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.8131861Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.8132278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.8132519Z 
2025-05-07T20:32:09.8132739Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c2b2760>
2025-05-07T20:32:09.8133900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.8135513Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7c137790>}
2025-05-07T20:32:09.8136981Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.8138092Z context = <triton._C.libtriton.ir.context object at 0x7f1cefb2d1f0>
2025-05-07T20:32:09.8138397Z 
2025-05-07T20:32:09.8138574Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.8139116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.8139611Z                            module_map=module_map)
2025-05-07T20:32:09.8139996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.8140353Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.8140632Z E       ^
2025-05-07T20:32:09.8141130Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.8141624Z 
2025-05-07T20:32:09.8142083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.8142640Z 
2025-05-07T20:32:09.8142744Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.8143180Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.8143613Z     T=2048,
2025-05-07T20:32:09.8143795Z     D=7168,
2025-05-07T20:32:09.8143988Z     scale_ub=1200.0,
2025-05-07T20:32:09.8144213Z     contiguous=False,
2025-05-07T20:32:09.8144433Z     compiled=True,
2025-05-07T20:32:09.8144640Z )
2025-05-07T20:32:09.8144967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.8145492Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:09.8145790Z 
2025-05-07T20:32:09.8145949Z     @given(
2025-05-07T20:32:09.8146184Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.8146509Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.8146821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.8147169Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.8147514Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.8147804Z     )
2025-05-07T20:32:09.8148169Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.8148639Z     def test_silu_mul_quant(
2025-05-07T20:32:09.8148874Z         self,
2025-05-07T20:32:09.8149064Z         T: int,
2025-05-07T20:32:09.8149262Z         D: int,
2025-05-07T20:32:09.8149481Z         scale_ub: Optional[float],
2025-05-07T20:32:09.8149875Z         contiguous: bool,
2025-05-07T20:32:09.8150120Z         compiled: bool,
2025-05-07T20:32:09.8150350Z     ) -> None:
2025-05-07T20:32:09.8150557Z         torch.manual_seed(2025)
2025-05-07T20:32:09.8150803Z     
2025-05-07T20:32:09.8151078Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.8151429Z     
2025-05-07T20:32:09.8151618Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.8151915Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.8152227Z         x = x_sign * x_clamp
2025-05-07T20:32:09.8152478Z         x0 = x[:, :D]
2025-05-07T20:32:09.8152703Z         x1 = x[:, D:]
2025-05-07T20:32:09.8152914Z     
2025-05-07T20:32:09.8153107Z         if contiguous:
2025-05-07T20:32:09.8153347Z             x0 = x0.contiguous()
2025-05-07T20:32:09.8153610Z             x1 = x1.contiguous()
2025-05-07T20:32:09.8153868Z     
2025-05-07T20:32:09.8154069Z         if scale_ub is not None:
2025-05-07T20:32:09.8154430Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.8154775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.8155100Z             )
2025-05-07T20:32:09.8155298Z         else:
2025-05-07T20:32:09.8155508Z             scale_ub_tensor = None
2025-05-07T20:32:09.8155768Z     
2025-05-07T20:32:09.8156002Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.8156324Z             op = silu_mul_quant
2025-05-07T20:32:09.8156579Z             if compiled:
2025-05-07T20:32:09.8156829Z                 op = torch.compile(op)
2025-05-07T20:32:09.8157126Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.8157416Z     
2025-05-07T20:32:09.8157618Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.8157794Z 
2025-05-07T20:32:09.8157896Z moe/activation_test.py:117: 
2025-05-07T20:32:09.8158206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.8158564Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.8158860Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.8159446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.8160045Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.8160749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.8161483Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.8162056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.8162786Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.8163492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.8164054Z     kernel = self.compile(
2025-05-07T20:32:09.8164628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.8165334Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.8165831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.8166080Z 
2025-05-07T20:32:09.8166292Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefb76c40>
2025-05-07T20:32:09.8167458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.8168966Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefadf4c0>}
2025-05-07T20:32:09.8170440Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.8171549Z context = <triton._C.libtriton.ir.context object at 0x7f1cefade730>
2025-05-07T20:32:09.8171867Z 
2025-05-07T20:32:09.8172035Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.8172582Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.8173071Z                            module_map=module_map)
2025-05-07T20:32:09.8173439Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.8173799Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.8174066Z E       ^
2025-05-07T20:32:09.8174551Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.8175042Z 
2025-05-07T20:32:09.8175489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.8176132Z 
2025-05-07T20:32:10.0873473Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.0874040Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.0874631Z     T=1,
2025-05-07T20:32:10.0874879Z     D=5120,
2025-05-07T20:32:10.0875082Z     scale_ub=None,
2025-05-07T20:32:10.0875307Z     contiguous=False,
2025-05-07T20:32:10.0875546Z     compiled=False,
2025-05-07T20:32:10.0875765Z )
2025-05-07T20:32:10.0876096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.0876619Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:10.0876902Z 
2025-05-07T20:32:10.0876989Z     @given(
2025-05-07T20:32:10.0877224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.0877555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.0877880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.0878240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.0878584Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.0878897Z     )
2025-05-07T20:32:10.0879273Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.0879744Z     def test_silu_mul_quant(
2025-05-07T20:32:10.0880001Z         self,
2025-05-07T20:32:10.0880207Z         T: int,
2025-05-07T20:32:10.0880408Z         D: int,
2025-05-07T20:32:10.0880638Z         scale_ub: Optional[float],
2025-05-07T20:32:10.0880926Z         contiguous: bool,
2025-05-07T20:32:10.0881173Z         compiled: bool,
2025-05-07T20:32:10.0881417Z     ) -> None:
2025-05-07T20:32:10.0881642Z         torch.manual_seed(2025)
2025-05-07T20:32:10.0881896Z     
2025-05-07T20:32:10.0882180Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.0882554Z     
2025-05-07T20:32:10.0883057Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.0883384Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.0883716Z         x = x_sign * x_clamp
2025-05-07T20:32:10.0883973Z         x0 = x[:, :D]
2025-05-07T20:32:10.0884198Z         x1 = x[:, D:]
2025-05-07T20:32:10.0884714Z     
2025-05-07T20:32:10.0884919Z         if contiguous:
2025-05-07T20:32:10.0885156Z             x0 = x0.contiguous()
2025-05-07T20:32:10.0885430Z             x1 = x1.contiguous()
2025-05-07T20:32:10.0893289Z     
2025-05-07T20:32:10.0893516Z         if scale_ub is not None:
2025-05-07T20:32:10.0893823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.0894185Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.0894521Z             )
2025-05-07T20:32:10.0894727Z         else:
2025-05-07T20:32:10.0894946Z             scale_ub_tensor = None
2025-05-07T20:32:10.0895212Z     
2025-05-07T20:32:10.0895453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.0895791Z             op = silu_mul_quant
2025-05-07T20:32:10.0896070Z             if compiled:
2025-05-07T20:32:10.0896328Z                 op = torch.compile(op)
2025-05-07T20:32:10.0896645Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0896938Z     
2025-05-07T20:32:10.0897143Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.0897327Z 
2025-05-07T20:32:10.0897432Z moe/activation_test.py:117: 
2025-05-07T20:32:10.0897751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0898105Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.0898408Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0899163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.0899920Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.0900486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.0901226Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.0902145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.0902732Z     kernel = self.compile(
2025-05-07T20:32:10.0903315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.0904031Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.0904460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0904708Z 
2025-05-07T20:32:10.0904927Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefb0e370>
2025-05-07T20:32:10.0906108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.0907630Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefadf820>}
2025-05-07T20:32:10.0909118Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.0910344Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c1dec30>
2025-05-07T20:32:10.0910652Z 
2025-05-07T20:32:10.0910824Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.0911380Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.0911875Z                            module_map=module_map)
2025-05-07T20:32:10.0912248Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.0912613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.0912882Z E       ^
2025-05-07T20:32:10.0913374Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.0913864Z 
2025-05-07T20:32:10.0914403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.0914958Z 
2025-05-07T20:32:10.0915061Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.0915528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.0915957Z     T=4096,
2025-05-07T20:32:10.0916154Z     D=7168,
2025-05-07T20:32:10.0916349Z     scale_ub=1200.0,
2025-05-07T20:32:10.0916578Z     contiguous=False,
2025-05-07T20:32:10.0916813Z     compiled=False,
2025-05-07T20:32:10.0917023Z )
2025-05-07T20:32:10.0917352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.0917879Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:10.0918177Z 
2025-05-07T20:32:10.0918262Z     @given(
2025-05-07T20:32:10.0918495Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.0918830Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.0919155Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.0919495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.0919840Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.0920147Z     )
2025-05-07T20:32:10.0920517Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.0920979Z     def test_silu_mul_quant(
2025-05-07T20:32:10.0921228Z         self,
2025-05-07T20:32:10.0921432Z         T: int,
2025-05-07T20:32:10.0921629Z         D: int,
2025-05-07T20:32:10.0921855Z         scale_ub: Optional[float],
2025-05-07T20:32:10.0922137Z         contiguous: bool,
2025-05-07T20:32:10.0922379Z         compiled: bool,
2025-05-07T20:32:10.0922700Z     ) -> None:
2025-05-07T20:32:10.0922922Z         torch.manual_seed(2025)
2025-05-07T20:32:10.0923170Z     
2025-05-07T20:32:10.0923454Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.0923831Z     
2025-05-07T20:32:10.0924026Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.0924329Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.0924657Z         x = x_sign * x_clamp
2025-05-07T20:32:10.0924903Z         x0 = x[:, :D]
2025-05-07T20:32:10.0925130Z         x1 = x[:, D:]
2025-05-07T20:32:10.0925352Z     
2025-05-07T20:32:10.0925537Z         if contiguous:
2025-05-07T20:32:10.0925784Z             x0 = x0.contiguous()
2025-05-07T20:32:10.0926059Z             x1 = x1.contiguous()
2025-05-07T20:32:10.0926314Z     
2025-05-07T20:32:10.0926505Z         if scale_ub is not None:
2025-05-07T20:32:10.0926792Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.0927140Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.0927465Z             )
2025-05-07T20:32:10.0927659Z         else:
2025-05-07T20:32:10.0927873Z             scale_ub_tensor = None
2025-05-07T20:32:10.0928127Z     
2025-05-07T20:32:10.0928363Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.0928693Z             op = silu_mul_quant
2025-05-07T20:32:10.0928942Z             if compiled:
2025-05-07T20:32:10.0929194Z                 op = torch.compile(op)
2025-05-07T20:32:10.0929502Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0929783Z     
2025-05-07T20:32:10.0929983Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.0930153Z 
2025-05-07T20:32:10.0930260Z moe/activation_test.py:117: 
2025-05-07T20:32:10.0930569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0930919Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.0931214Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0931955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.0932709Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.0933368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.0934107Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.0934822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.0935390Z     kernel = self.compile(
2025-05-07T20:32:10.0935961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.0936669Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.0937082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0937336Z 
2025-05-07T20:32:10.0937552Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c1e2a30>
2025-05-07T20:32:10.0938733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.0940281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7c1dcaf0>}
2025-05-07T20:32:10.0941750Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.0942853Z context = <triton._C.libtriton.ir.context object at 0x7f1cefd45570>
2025-05-07T20:32:10.0943166Z 
2025-05-07T20:32:10.0943335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.0943969Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.0944460Z                            module_map=module_map)
2025-05-07T20:32:10.0944839Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.0945205Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.0945476Z E       ^
2025-05-07T20:32:10.0945971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.0946470Z 
2025-05-07T20:32:10.0946923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.0947490Z 
2025-05-07T20:32:10.0947596Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.0948030Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.0948456Z     T=16384,
2025-05-07T20:32:10.0948684Z     D=7168,
2025-05-07T20:32:10.0948900Z     scale_ub=None,
2025-05-07T20:32:10.0949113Z     contiguous=True,
2025-05-07T20:32:10.0949337Z     compiled=True,
2025-05-07T20:32:10.0949541Z )
2025-05-07T20:32:10.2138451Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.2139506Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:10.2139816Z 
2025-05-07T20:32:10.2139898Z     @given(
2025-05-07T20:32:10.2140129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.2140456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.2140767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.2141109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.2141447Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.2141741Z     )
2025-05-07T20:32:10.2142093Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.2142558Z     def test_silu_mul_quant(
2025-05-07T20:32:10.2142806Z         self,
2025-05-07T20:32:10.2142994Z         T: int,
2025-05-07T20:32:10.2143192Z         D: int,
2025-05-07T20:32:10.2143413Z         scale_ub: Optional[float],
2025-05-07T20:32:10.2143979Z         contiguous: bool,
2025-05-07T20:32:10.2144227Z         compiled: bool,
2025-05-07T20:32:10.2144459Z     ) -> None:
2025-05-07T20:32:10.2144673Z         torch.manual_seed(2025)
2025-05-07T20:32:10.2144921Z     
2025-05-07T20:32:10.2145198Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.2145545Z     
2025-05-07T20:32:10.2145741Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.2146035Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.2146352Z         x = x_sign * x_clamp
2025-05-07T20:32:10.2146592Z         x0 = x[:, :D]
2025-05-07T20:32:10.2146809Z         x1 = x[:, D:]
2025-05-07T20:32:10.2147019Z     
2025-05-07T20:32:10.2147196Z         if contiguous:
2025-05-07T20:32:10.2147429Z             x0 = x0.contiguous()
2025-05-07T20:32:10.2147694Z             x1 = x1.contiguous()
2025-05-07T20:32:10.2147933Z     
2025-05-07T20:32:10.2148125Z         if scale_ub is not None:
2025-05-07T20:32:10.2148406Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.2148773Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.2149114Z             )
2025-05-07T20:32:10.2149309Z         else:
2025-05-07T20:32:10.2149516Z             scale_ub_tensor = None
2025-05-07T20:32:10.2149906Z     
2025-05-07T20:32:10.2150136Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.2150453Z             op = silu_mul_quant
2025-05-07T20:32:10.2150705Z             if compiled:
2025-05-07T20:32:10.2150950Z                 op = torch.compile(op)
2025-05-07T20:32:10.2151244Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.2151526Z     
2025-05-07T20:32:10.2151718Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.2151887Z 
2025-05-07T20:32:10.2152147Z moe/activation_test.py:117: 
2025-05-07T20:32:10.2152441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.2152784Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.2153075Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.2153658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.2154258Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.2154966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.2155713Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.2156274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.2157006Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.2157711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.2158277Z     kernel = self.compile(
2025-05-07T20:32:10.2158864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.2159568Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.2159975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.2160223Z 
2025-05-07T20:32:10.2160437Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c0ff8e0>
2025-05-07T20:32:10.2161606Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.2163124Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefeec790>}
2025-05-07T20:32:10.2164700Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.2165810Z context = <triton._C.libtriton.ir.context object at 0x7f1cefec35b0>
2025-05-07T20:32:10.2166120Z 
2025-05-07T20:32:10.2166289Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.2166843Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.2167328Z                            module_map=module_map)
2025-05-07T20:32:10.2167706Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.2168072Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.2168340Z E       ^
2025-05-07T20:32:10.2168827Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.2169324Z 
2025-05-07T20:32:10.2169771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.2170328Z 
2025-05-07T20:32:10.2170437Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.2170862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.2171285Z     T=4096,
2025-05-07T20:32:10.2171478Z     D=5120,
2025-05-07T20:32:10.2171671Z     scale_ub=None,
2025-05-07T20:32:10.2171884Z     contiguous=False,
2025-05-07T20:32:10.2172111Z     compiled=True,
2025-05-07T20:32:10.2172321Z )
2025-05-07T20:32:10.2172641Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.2173156Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:10.2173441Z 
2025-05-07T20:32:10.2173525Z     @given(
2025-05-07T20:32:10.2173748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.2174153Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.2174465Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.2174805Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.2175145Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.2175442Z     )
2025-05-07T20:32:10.2175802Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.2176260Z     def test_silu_mul_quant(
2025-05-07T20:32:10.2176509Z         self,
2025-05-07T20:32:10.2176700Z         T: int,
2025-05-07T20:32:10.2176894Z         D: int,
2025-05-07T20:32:10.2177115Z         scale_ub: Optional[float],
2025-05-07T20:32:10.2177392Z         contiguous: bool,
2025-05-07T20:32:10.2177629Z         compiled: bool,
2025-05-07T20:32:10.2177855Z     ) -> None:
2025-05-07T20:32:10.2178071Z         torch.manual_seed(2025)
2025-05-07T20:32:10.2178314Z     
2025-05-07T20:32:10.2178599Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.2178954Z     
2025-05-07T20:32:10.2179145Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.2179445Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.2179773Z         x = x_sign * x_clamp
2025-05-07T20:32:10.2180016Z         x0 = x[:, :D]
2025-05-07T20:32:10.2180229Z         x1 = x[:, D:]
2025-05-07T20:32:10.2180445Z     
2025-05-07T20:32:10.2180635Z         if contiguous:
2025-05-07T20:32:10.2180862Z             x0 = x0.contiguous()
2025-05-07T20:32:10.2181126Z             x1 = x1.contiguous()
2025-05-07T20:32:10.2181375Z     
2025-05-07T20:32:10.2181563Z         if scale_ub is not None:
2025-05-07T20:32:10.2181842Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.2182181Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.2182496Z             )
2025-05-07T20:32:10.2182693Z         else:
2025-05-07T20:32:10.2183180Z             scale_ub_tensor = None
2025-05-07T20:32:10.2183438Z     
2025-05-07T20:32:10.2183670Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.2183998Z             op = silu_mul_quant
2025-05-07T20:32:10.2184372Z             if compiled:
2025-05-07T20:32:10.2184626Z                 op = torch.compile(op)
2025-05-07T20:32:10.2184930Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.2185206Z     
2025-05-07T20:32:10.2185398Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.2185568Z 
2025-05-07T20:32:10.2185665Z moe/activation_test.py:117: 
2025-05-07T20:32:10.2185963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.2186302Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.2186588Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.2187174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.2187765Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.2188474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.2189215Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.2189874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.2190599Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.2191307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.2191876Z     kernel = self.compile(
2025-05-07T20:32:10.2192440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.2193135Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.2193548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.2193929Z 
2025-05-07T20:32:10.2194147Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefd3f6d0>
2025-05-07T20:32:10.2195311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.2196810Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefdb2550>}
2025-05-07T20:32:10.2198276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.2199382Z context = <triton._C.libtriton.ir.context object at 0x7f1cefdb1030>
2025-05-07T20:32:10.2199684Z 
2025-05-07T20:32:10.2199857Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.2200401Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.2200895Z                            module_map=module_map)
2025-05-07T20:32:10.2201273Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.2201633Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.2201900Z E       ^
2025-05-07T20:32:10.2202392Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.2202881Z 
2025-05-07T20:32:10.2203334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.2203887Z 
2025-05-07T20:32:10.6223654Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.6224864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.6226074Z     T=4096,
2025-05-07T20:32:10.6226647Z     D=5120,
2025-05-07T20:32:10.6227036Z     scale_ub=1200.0,
2025-05-07T20:32:10.6227565Z     contiguous=False,
2025-05-07T20:32:10.6228215Z     compiled=False,
2025-05-07T20:32:10.6228773Z )
2025-05-07T20:32:10.6229969Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.6230693Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:10.6231093Z 
2025-05-07T20:32:10.6231175Z     @given(
2025-05-07T20:32:10.6231408Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.6231728Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.6232041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.6232380Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.6232718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.6233021Z     )
2025-05-07T20:32:10.6233377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.6233849Z     def test_silu_mul_quant(
2025-05-07T20:32:10.6234096Z         self,
2025-05-07T20:32:10.6234281Z         T: int,
2025-05-07T20:32:10.6234478Z         D: int,
2025-05-07T20:32:10.6234705Z         scale_ub: Optional[float],
2025-05-07T20:32:10.6234973Z         contiguous: bool,
2025-05-07T20:32:10.6235217Z         compiled: bool,
2025-05-07T20:32:10.6235446Z     ) -> None:
2025-05-07T20:32:10.6235657Z         torch.manual_seed(2025)
2025-05-07T20:32:10.6235905Z     
2025-05-07T20:32:10.6236184Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.6236539Z     
2025-05-07T20:32:10.6236737Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.6237035Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.6237348Z         x = x_sign * x_clamp
2025-05-07T20:32:10.6237595Z         x0 = x[:, :D]
2025-05-07T20:32:10.6237814Z         x1 = x[:, D:]
2025-05-07T20:32:10.6238027Z     
2025-05-07T20:32:10.6238212Z         if contiguous:
2025-05-07T20:32:10.6238615Z             x0 = x0.contiguous()
2025-05-07T20:32:10.6238908Z             x1 = x1.contiguous()
2025-05-07T20:32:10.6239158Z     
2025-05-07T20:32:10.6239348Z         if scale_ub is not None:
2025-05-07T20:32:10.6239636Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.6239982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.6240308Z             )
2025-05-07T20:32:10.6240498Z         else:
2025-05-07T20:32:10.6240712Z             scale_ub_tensor = None
2025-05-07T20:32:10.6240975Z     
2025-05-07T20:32:10.6241202Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.6241532Z             op = silu_mul_quant
2025-05-07T20:32:10.6241788Z             if compiled:
2025-05-07T20:32:10.6242033Z                 op = torch.compile(op)
2025-05-07T20:32:10.6242336Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.6242617Z     
2025-05-07T20:32:10.6242806Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.6242985Z 
2025-05-07T20:32:10.6243083Z moe/activation_test.py:117: 
2025-05-07T20:32:10.6243388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6243734Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.6244023Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.6244767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.6245513Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.6246079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.6246812Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.6247524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.6248092Z     kernel = self.compile(
2025-05-07T20:32:10.6248664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.6249368Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.6249892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6250137Z 
2025-05-07T20:32:10.6250349Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefd3f430>
2025-05-07T20:32:10.6251523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.6253036Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefcc50d0>}
2025-05-07T20:32:10.6254502Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.6255623Z context = <triton._C.libtriton.ir.context object at 0x7f1cefffaa70>
2025-05-07T20:32:10.6255926Z 
2025-05-07T20:32:10.6256093Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.6256640Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.6257131Z                            module_map=module_map)
2025-05-07T20:32:10.6257509Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.6257867Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.6258131Z E       ^
2025-05-07T20:32:10.6258626Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.6259115Z 
2025-05-07T20:32:10.6259562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.6267268Z 
2025-05-07T20:32:10.6267400Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.6267867Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.6268307Z     T=4096,
2025-05-07T20:32:10.6268493Z     D=5120,
2025-05-07T20:32:10.6268692Z     scale_ub=1200.0,
2025-05-07T20:32:10.6268925Z     contiguous=False,
2025-05-07T20:32:10.6269148Z     compiled=True,
2025-05-07T20:32:10.6269359Z )
2025-05-07T20:32:10.6269831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.6270355Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:10.6270655Z 
2025-05-07T20:32:10.6270733Z     @given(
2025-05-07T20:32:10.6270971Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.6271289Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.6271606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.6271963Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.6272315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.6272612Z     )
2025-05-07T20:32:10.6272987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.6273459Z     def test_silu_mul_quant(
2025-05-07T20:32:10.6273701Z         self,
2025-05-07T20:32:10.6273897Z         T: int,
2025-05-07T20:32:10.6274096Z         D: int,
2025-05-07T20:32:10.6274312Z         scale_ub: Optional[float],
2025-05-07T20:32:10.6274598Z         contiguous: bool,
2025-05-07T20:32:10.6274843Z         compiled: bool,
2025-05-07T20:32:10.6275067Z     ) -> None:
2025-05-07T20:32:10.6275288Z         torch.manual_seed(2025)
2025-05-07T20:32:10.6275539Z     
2025-05-07T20:32:10.6275814Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.6276182Z     
2025-05-07T20:32:10.6276376Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.6276677Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.6277004Z         x = x_sign * x_clamp
2025-05-07T20:32:10.6277255Z         x0 = x[:, :D]
2025-05-07T20:32:10.6277603Z         x1 = x[:, D:]
2025-05-07T20:32:10.6277810Z     
2025-05-07T20:32:10.6277996Z         if contiguous:
2025-05-07T20:32:10.6278231Z             x0 = x0.contiguous()
2025-05-07T20:32:10.6278486Z             x1 = x1.contiguous()
2025-05-07T20:32:10.6278741Z     
2025-05-07T20:32:10.6278976Z         if scale_ub is not None:
2025-05-07T20:32:10.6279261Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.6279604Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.6279926Z             )
2025-05-07T20:32:10.6280121Z         else:
2025-05-07T20:32:10.6280327Z             scale_ub_tensor = None
2025-05-07T20:32:10.6280583Z     
2025-05-07T20:32:10.6280817Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.6281141Z             op = silu_mul_quant
2025-05-07T20:32:10.6281398Z             if compiled:
2025-05-07T20:32:10.6281649Z                 op = torch.compile(op)
2025-05-07T20:32:10.6281957Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.6282241Z     
2025-05-07T20:32:10.6282435Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.6282602Z 
2025-05-07T20:32:10.6282709Z moe/activation_test.py:117: 
2025-05-07T20:32:10.6283349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6283699Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.6283986Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.6284571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.6285172Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.6285882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.6286782Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.6287351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.6288087Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.6288794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.6289355Z     kernel = self.compile(
2025-05-07T20:32:10.6289927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.6290627Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.6291042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6291286Z 
2025-05-07T20:32:10.6291499Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefd3a850>
2025-05-07T20:32:10.6292681Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.6294193Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefcc5dc0>}
2025-05-07T20:32:10.6295660Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.6296764Z context = <triton._C.libtriton.ir.context object at 0x7f1cefbfbcb0>
2025-05-07T20:32:10.6297074Z 
2025-05-07T20:32:10.6297244Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.6297795Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.6298294Z                            module_map=module_map)
2025-05-07T20:32:10.6298666Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.6299152Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.6299426Z E       ^
2025-05-07T20:32:10.6299920Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.6300418Z 
2025-05-07T20:32:10.6300873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.6301439Z 
2025-05-07T20:32:10.9048453Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.9049111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.9049740Z     T=2048,
2025-05-07T20:32:10.9050016Z     D=7168,
2025-05-07T20:32:10.9050215Z     scale_ub=1200.0,
2025-05-07T20:32:10.9050449Z     contiguous=False,
2025-05-07T20:32:10.9050714Z     compiled=False,
2025-05-07T20:32:10.9050926Z )
2025-05-07T20:32:10.9051261Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.9051807Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:10.9052104Z 
2025-05-07T20:32:10.9052185Z     @given(
2025-05-07T20:32:10.9052417Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.9052741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.9053051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.9053392Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.9053735Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.9054033Z     )
2025-05-07T20:32:10.9054391Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.9054858Z     def test_silu_mul_quant(
2025-05-07T20:32:10.9055113Z         self,
2025-05-07T20:32:10.9055652Z         T: int,
2025-05-07T20:32:10.9055852Z         D: int,
2025-05-07T20:32:10.9056073Z         scale_ub: Optional[float],
2025-05-07T20:32:10.9056345Z         contiguous: bool,
2025-05-07T20:32:10.9056588Z         compiled: bool,
2025-05-07T20:32:10.9056827Z     ) -> None:
2025-05-07T20:32:10.9057041Z         torch.manual_seed(2025)
2025-05-07T20:32:10.9057291Z     
2025-05-07T20:32:10.9057574Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.9057930Z     
2025-05-07T20:32:10.9058134Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.9058430Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.9058747Z         x = x_sign * x_clamp
2025-05-07T20:32:10.9059021Z         x0 = x[:, :D]
2025-05-07T20:32:10.9059267Z         x1 = x[:, D:]
2025-05-07T20:32:10.9059480Z     
2025-05-07T20:32:10.9059662Z         if contiguous:
2025-05-07T20:32:10.9059898Z             x0 = x0.contiguous()
2025-05-07T20:32:10.9060164Z             x1 = x1.contiguous()
2025-05-07T20:32:10.9060411Z     
2025-05-07T20:32:10.9060607Z         if scale_ub is not None:
2025-05-07T20:32:10.9060889Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.9061234Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.9061565Z             )
2025-05-07T20:32:10.9061758Z         else:
2025-05-07T20:32:10.9061968Z             scale_ub_tensor = None
2025-05-07T20:32:10.9062229Z     
2025-05-07T20:32:10.9062464Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.9062783Z             op = silu_mul_quant
2025-05-07T20:32:10.9063045Z             if compiled:
2025-05-07T20:32:10.9063297Z                 op = torch.compile(op)
2025-05-07T20:32:10.9063597Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.9063890Z     
2025-05-07T20:32:10.9064089Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.9064258Z 
2025-05-07T20:32:10.9064363Z moe/activation_test.py:117: 
2025-05-07T20:32:10.9064660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9065012Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.9065303Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.9066193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.9066941Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.9067512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.9068246Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.9068952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.9069518Z     kernel = self.compile(
2025-05-07T20:32:10.9070238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.9070934Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.9071345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9071592Z 
2025-05-07T20:32:10.9071811Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c013880>
2025-05-07T20:32:10.9072979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.9074496Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7c027670>}
2025-05-07T20:32:10.9075967Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.9077161Z context = <triton._C.libtriton.ir.context object at 0x7f1cefc2a6b0>
2025-05-07T20:32:10.9077471Z 
2025-05-07T20:32:10.9077652Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.9078197Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.9078682Z                            module_map=module_map)
2025-05-07T20:32:10.9079060Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.9079421Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.9079681Z E       ^
2025-05-07T20:32:10.9080173Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.9080660Z 
2025-05-07T20:32:10.9081113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.9081671Z 
2025-05-07T20:32:10.9081784Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.9082205Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.9082628Z     T=1,
2025-05-07T20:32:10.9083106Z     D=7168,
2025-05-07T20:32:10.9083303Z     scale_ub=None,
2025-05-07T20:32:10.9083517Z     contiguous=True,
2025-05-07T20:32:10.9083744Z     compiled=False,
2025-05-07T20:32:10.9083945Z )
2025-05-07T20:32:10.9084267Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.9084777Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:10.9085055Z 
2025-05-07T20:32:10.9085140Z     @given(
2025-05-07T20:32:10.9085364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.9085687Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.9086003Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.9086338Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.9086681Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.9086981Z     )
2025-05-07T20:32:10.9087336Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.9087932Z     def test_silu_mul_quant(
2025-05-07T20:32:10.9088184Z         self,
2025-05-07T20:32:10.9088373Z         T: int,
2025-05-07T20:32:10.9088575Z         D: int,
2025-05-07T20:32:10.9088797Z         scale_ub: Optional[float],
2025-05-07T20:32:10.9089092Z         contiguous: bool,
2025-05-07T20:32:10.9089371Z         compiled: bool,
2025-05-07T20:32:10.9089595Z     ) -> None:
2025-05-07T20:32:10.9089806Z         torch.manual_seed(2025)
2025-05-07T20:32:10.9090053Z     
2025-05-07T20:32:10.9090330Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.9090681Z     
2025-05-07T20:32:10.9090875Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.9091170Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.9091486Z         x = x_sign * x_clamp
2025-05-07T20:32:10.9091738Z         x0 = x[:, :D]
2025-05-07T20:32:10.9091960Z         x1 = x[:, D:]
2025-05-07T20:32:10.9092160Z     
2025-05-07T20:32:10.9092343Z         if contiguous:
2025-05-07T20:32:10.9092582Z             x0 = x0.contiguous()
2025-05-07T20:32:10.9092839Z             x1 = x1.contiguous()
2025-05-07T20:32:10.9093085Z     
2025-05-07T20:32:10.9093280Z         if scale_ub is not None:
2025-05-07T20:32:10.9093548Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.9093890Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.9094211Z             )
2025-05-07T20:32:10.9094401Z         else:
2025-05-07T20:32:10.9094606Z             scale_ub_tensor = None
2025-05-07T20:32:10.9094865Z     
2025-05-07T20:32:10.9095095Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.9095416Z             op = silu_mul_quant
2025-05-07T20:32:10.9095668Z             if compiled:
2025-05-07T20:32:10.9095918Z                 op = torch.compile(op)
2025-05-07T20:32:10.9096763Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.9097049Z     
2025-05-07T20:32:10.9097241Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.9097407Z 
2025-05-07T20:32:10.9097510Z moe/activation_test.py:117: 
2025-05-07T20:32:10.9097811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9098154Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.9098442Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.9099176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.9099919Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.9100485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.9101208Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.9101921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.9102487Z     kernel = self.compile(
2025-05-07T20:32:10.9103061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.9103753Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.9104166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9104408Z 
2025-05-07T20:32:10.9104627Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c009820>
2025-05-07T20:32:10.9105793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.9107286Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef902160>}
2025-05-07T20:32:10.9108840Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.9110051Z context = <triton._C.libtriton.ir.context object at 0x7f1cef919cb0>
2025-05-07T20:32:10.9110356Z 
2025-05-07T20:32:10.9110531Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.9111076Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.9111565Z                            module_map=module_map)
2025-05-07T20:32:10.9111942Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.9112308Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.9112562Z E       ^
2025-05-07T20:32:10.9113058Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.9113553Z 
2025-05-07T20:32:10.9114012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.9114566Z 
2025-05-07T20:32:10.9114668Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.9115095Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.9115513Z     T=16384,
2025-05-07T20:32:10.9115705Z     D=7168,
2025-05-07T20:32:10.9115891Z     scale_ub=1200.0,
2025-05-07T20:32:10.9116115Z     contiguous=False,
2025-05-07T20:32:10.9116336Z     compiled=True,
2025-05-07T20:32:10.9116534Z )
2025-05-07T20:32:11.1032729Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.1033441Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.1033817Z 
2025-05-07T20:32:11.1033898Z     @given(
2025-05-07T20:32:11.1034514Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.1034834Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.1035159Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.1035498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.1035837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.1036127Z     )
2025-05-07T20:32:11.1036493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.1036959Z     def test_silu_mul_quant(
2025-05-07T20:32:11.1037195Z         self,
2025-05-07T20:32:11.1037393Z         T: int,
2025-05-07T20:32:11.1037597Z         D: int,
2025-05-07T20:32:11.1037808Z         scale_ub: Optional[float],
2025-05-07T20:32:11.1038082Z         contiguous: bool,
2025-05-07T20:32:11.1038325Z         compiled: bool,
2025-05-07T20:32:11.1038560Z     ) -> None:
2025-05-07T20:32:11.1038780Z         torch.manual_seed(2025)
2025-05-07T20:32:11.1039081Z     
2025-05-07T20:32:11.1039354Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.1039718Z     
2025-05-07T20:32:11.1039908Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.1040200Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.1040520Z         x = x_sign * x_clamp
2025-05-07T20:32:11.1040765Z         x0 = x[:, :D]
2025-05-07T20:32:11.1040985Z         x1 = x[:, D:]
2025-05-07T20:32:11.1041189Z     
2025-05-07T20:32:11.1041377Z         if contiguous:
2025-05-07T20:32:11.1041609Z             x0 = x0.contiguous()
2025-05-07T20:32:11.1041868Z             x1 = x1.contiguous()
2025-05-07T20:32:11.1042112Z     
2025-05-07T20:32:11.1042303Z         if scale_ub is not None:
2025-05-07T20:32:11.1042575Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.1042919Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.1043237Z             )
2025-05-07T20:32:11.1043423Z         else:
2025-05-07T20:32:11.1043641Z             scale_ub_tensor = None
2025-05-07T20:32:11.1043894Z     
2025-05-07T20:32:11.1044119Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.1044598Z             op = silu_mul_quant
2025-05-07T20:32:11.1044855Z             if compiled:
2025-05-07T20:32:11.1045098Z                 op = torch.compile(op)
2025-05-07T20:32:11.1045401Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1045685Z     
2025-05-07T20:32:11.1045872Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.1046044Z 
2025-05-07T20:32:11.1046142Z moe/activation_test.py:117: 
2025-05-07T20:32:11.1046441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1046783Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.1047063Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1047656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.1048262Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.1048962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.1049714Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.1050285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.1051017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.1051719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.1052287Z     kernel = self.compile(
2025-05-07T20:32:11.1052858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.1053559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.1053972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1054311Z 
2025-05-07T20:32:11.1054525Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefc27670>
2025-05-07T20:32:11.1055706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.1057235Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef9024c0>}
2025-05-07T20:32:11.1058700Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.1059812Z context = <triton._C.libtriton.ir.context object at 0x7f1cefc7cbb0>
2025-05-07T20:32:11.1060129Z 
2025-05-07T20:32:11.1060297Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.1060846Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.1061341Z                            module_map=module_map)
2025-05-07T20:32:11.1061719Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.1062085Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.1062341Z E       ^
2025-05-07T20:32:11.1062831Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.1063322Z 
2025-05-07T20:32:11.1063769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.1064322Z 
2025-05-07T20:32:11.1064431Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.1064855Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.1065278Z     T=1,
2025-05-07T20:32:11.1065461Z     D=7168,
2025-05-07T20:32:11.1065648Z     scale_ub=None,
2025-05-07T20:32:11.1065866Z     contiguous=False,
2025-05-07T20:32:11.1066091Z     compiled=False,
2025-05-07T20:32:11.1066375Z )
2025-05-07T20:32:11.1066702Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.1067212Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.1067487Z 
2025-05-07T20:32:11.1067570Z     @given(
2025-05-07T20:32:11.1067794Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.1068112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.1068423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.1068757Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.1069093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.1069389Z     )
2025-05-07T20:32:11.1069912Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.1070389Z     def test_silu_mul_quant(
2025-05-07T20:32:11.1070635Z         self,
2025-05-07T20:32:11.1070829Z         T: int,
2025-05-07T20:32:11.1071025Z         D: int,
2025-05-07T20:32:11.1071247Z         scale_ub: Optional[float],
2025-05-07T20:32:11.1071525Z         contiguous: bool,
2025-05-07T20:32:11.1071787Z         compiled: bool,
2025-05-07T20:32:11.1072007Z     ) -> None:
2025-05-07T20:32:11.1072225Z         torch.manual_seed(2025)
2025-05-07T20:32:11.1072474Z     
2025-05-07T20:32:11.1072744Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.1073103Z     
2025-05-07T20:32:11.1073298Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.1073586Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.1073905Z         x = x_sign * x_clamp
2025-05-07T20:32:11.1074148Z         x0 = x[:, :D]
2025-05-07T20:32:11.1074359Z         x1 = x[:, D:]
2025-05-07T20:32:11.1074658Z     
2025-05-07T20:32:11.1074843Z         if contiguous:
2025-05-07T20:32:11.1075070Z             x0 = x0.contiguous()
2025-05-07T20:32:11.1075335Z             x1 = x1.contiguous()
2025-05-07T20:32:11.1075580Z     
2025-05-07T20:32:11.1075771Z         if scale_ub is not None:
2025-05-07T20:32:11.1083737Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.1084150Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.1084482Z             )
2025-05-07T20:32:11.1084686Z         else:
2025-05-07T20:32:11.1084897Z             scale_ub_tensor = None
2025-05-07T20:32:11.1085168Z     
2025-05-07T20:32:11.1085408Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.1085743Z             op = silu_mul_quant
2025-05-07T20:32:11.1085998Z             if compiled:
2025-05-07T20:32:11.1086256Z                 op = torch.compile(op)
2025-05-07T20:32:11.1086566Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1086849Z     
2025-05-07T20:32:11.1087061Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.1087234Z 
2025-05-07T20:32:11.1087344Z moe/activation_test.py:117: 
2025-05-07T20:32:11.1087651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1088007Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.1088303Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.1089047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.1089811Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.1090387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.1091131Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.1091843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.1092431Z     kernel = self.compile(
2025-05-07T20:32:11.1093012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.1093894Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.1094314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.1094564Z 
2025-05-07T20:32:11.1094783Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef9f62b0>
2025-05-07T20:32:11.1095969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.1097480Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefc69820>}
2025-05-07T20:32:11.1098942Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.1100061Z context = <triton._C.libtriton.ir.context object at 0x7f1cefa55af0>
2025-05-07T20:32:11.1100375Z 
2025-05-07T20:32:11.1100543Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.1101092Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.1101581Z                            module_map=module_map)
2025-05-07T20:32:11.1101960Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.1102331Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.1102589Z E       ^
2025-05-07T20:32:11.1103083Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.1103569Z 
2025-05-07T20:32:11.1104154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.1104707Z 
2025-05-07T20:32:11.1104812Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.1105242Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.1105666Z     T=2048,
2025-05-07T20:32:11.1105848Z     D=7168,
2025-05-07T20:32:11.1106042Z     scale_ub=None,
2025-05-07T20:32:11.1106255Z     contiguous=False,
2025-05-07T20:32:11.1106480Z     compiled=True,
2025-05-07T20:32:11.1106678Z )
2025-05-07T20:32:11.2288336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.2289132Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.2289437Z 
2025-05-07T20:32:11.2289518Z     @given(
2025-05-07T20:32:11.2289756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.2290076Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.2290418Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.2290760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.2291103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.2291400Z     )
2025-05-07T20:32:11.2291765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.2292226Z     def test_silu_mul_quant(
2025-05-07T20:32:11.2292475Z         self,
2025-05-07T20:32:11.2292667Z         T: int,
2025-05-07T20:32:11.2292857Z         D: int,
2025-05-07T20:32:11.2293081Z         scale_ub: Optional[float],
2025-05-07T20:32:11.2293360Z         contiguous: bool,
2025-05-07T20:32:11.2293602Z         compiled: bool,
2025-05-07T20:32:11.2293829Z     ) -> None:
2025-05-07T20:32:11.2294046Z         torch.manual_seed(2025)
2025-05-07T20:32:11.2294293Z     
2025-05-07T20:32:11.2294562Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.2294929Z     
2025-05-07T20:32:11.2295123Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.2295414Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.2295738Z         x = x_sign * x_clamp
2025-05-07T20:32:11.2296506Z         x0 = x[:, :D]
2025-05-07T20:32:11.2296729Z         x1 = x[:, D:]
2025-05-07T20:32:11.2296940Z     
2025-05-07T20:32:11.2297126Z         if contiguous:
2025-05-07T20:32:11.2297357Z             x0 = x0.contiguous()
2025-05-07T20:32:11.2297621Z             x1 = x1.contiguous()
2025-05-07T20:32:11.2297865Z     
2025-05-07T20:32:11.2298054Z         if scale_ub is not None:
2025-05-07T20:32:11.2298334Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.2298683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.2299003Z             )
2025-05-07T20:32:11.2299190Z         else:
2025-05-07T20:32:11.2299404Z             scale_ub_tensor = None
2025-05-07T20:32:11.2299664Z     
2025-05-07T20:32:11.2299889Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.2300221Z             op = silu_mul_quant
2025-05-07T20:32:11.2300480Z             if compiled:
2025-05-07T20:32:11.2300727Z                 op = torch.compile(op)
2025-05-07T20:32:11.2301037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.2301320Z     
2025-05-07T20:32:11.2301506Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.2301680Z 
2025-05-07T20:32:11.2301779Z moe/activation_test.py:117: 
2025-05-07T20:32:11.2302083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.2302421Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.2302705Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.2303301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.2303901Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.2304602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.2305512Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.2306090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.2306812Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.2307526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.2308100Z     kernel = self.compile(
2025-05-07T20:32:11.2308673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.2309418Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.2309962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.2310205Z 
2025-05-07T20:32:11.2310428Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefa74d90>
2025-05-07T20:32:11.2311611Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.2313218Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefa93790>}
2025-05-07T20:32:11.2314682Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.2315790Z context = <triton._C.libtriton.ir.context object at 0x7f1cefab0570>
2025-05-07T20:32:11.2316101Z 
2025-05-07T20:32:11.2316271Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.2316825Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.2317314Z                            module_map=module_map)
2025-05-07T20:32:11.2317776Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.2318147Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.2318411Z E       ^
2025-05-07T20:32:11.2318908Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.2319406Z 
2025-05-07T20:32:11.2319854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.2320408Z 
2025-05-07T20:32:11.2320518Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.2320943Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.2321369Z     T=4096,
2025-05-07T20:32:11.2321556Z     D=7168,
2025-05-07T20:32:11.2321742Z     scale_ub=None,
2025-05-07T20:32:11.2321964Z     contiguous=False,
2025-05-07T20:32:11.2322192Z     compiled=True,
2025-05-07T20:32:11.2322397Z )
2025-05-07T20:32:11.2322717Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.2323247Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.2323536Z 
2025-05-07T20:32:11.2323618Z     @given(
2025-05-07T20:32:11.2323842Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.2324167Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.2324486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.2324817Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.2325157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.2325454Z     )
2025-05-07T20:32:11.2325815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.2326276Z     def test_silu_mul_quant(
2025-05-07T20:32:11.2326610Z         self,
2025-05-07T20:32:11.2326802Z         T: int,
2025-05-07T20:32:11.2326994Z         D: int,
2025-05-07T20:32:11.2327212Z         scale_ub: Optional[float],
2025-05-07T20:32:11.2327487Z         contiguous: bool,
2025-05-07T20:32:11.2327753Z         compiled: bool,
2025-05-07T20:32:11.2328078Z     ) -> None:
2025-05-07T20:32:11.2328303Z         torch.manual_seed(2025)
2025-05-07T20:32:11.2328551Z     
2025-05-07T20:32:11.2328832Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.2329200Z     
2025-05-07T20:32:11.2329395Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.2329699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.2330026Z         x = x_sign * x_clamp
2025-05-07T20:32:11.2330271Z         x0 = x[:, :D]
2025-05-07T20:32:11.2330499Z         x1 = x[:, D:]
2025-05-07T20:32:11.2330714Z     
2025-05-07T20:32:11.2330898Z         if contiguous:
2025-05-07T20:32:11.2331138Z             x0 = x0.contiguous()
2025-05-07T20:32:11.2331418Z             x1 = x1.contiguous()
2025-05-07T20:32:11.2331678Z     
2025-05-07T20:32:11.2331872Z         if scale_ub is not None:
2025-05-07T20:32:11.2332157Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.2332511Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.2332831Z             )
2025-05-07T20:32:11.2333024Z         else:
2025-05-07T20:32:11.2333239Z             scale_ub_tensor = None
2025-05-07T20:32:11.2333493Z     
2025-05-07T20:32:11.2333724Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.2334052Z             op = silu_mul_quant
2025-05-07T20:32:11.2334307Z             if compiled:
2025-05-07T20:32:11.2334565Z                 op = torch.compile(op)
2025-05-07T20:32:11.2334881Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.2335169Z     
2025-05-07T20:32:11.2335372Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.2335543Z 
2025-05-07T20:32:11.2335652Z moe/activation_test.py:117: 
2025-05-07T20:32:11.2335971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.2336319Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.2336617Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.2337312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.2337908Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.2338616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.2339371Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.2339991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.2340745Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.2341462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.2342031Z     kernel = self.compile(
2025-05-07T20:32:11.2342611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.2343326Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.2343751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.2344002Z 
2025-05-07T20:32:11.2344217Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef97bbb0>
2025-05-07T20:32:11.2345392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.2346900Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef9944c0>}
2025-05-07T20:32:11.2348460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.2349573Z context = <triton._C.libtriton.ir.context object at 0x7f1cef9bbd30>
2025-05-07T20:32:11.2349970Z 
2025-05-07T20:32:11.2350143Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.2350697Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.2351196Z                            module_map=module_map)
2025-05-07T20:32:11.2351569Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.2351944Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.2352217Z E       ^
2025-05-07T20:32:11.2352713Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.2353224Z 
2025-05-07T20:32:11.2353677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.2354239Z 
2025-05-07T20:32:11.4425159Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.4426336Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.4427188Z     T=16384,
2025-05-07T20:32:11.4427564Z     D=5120,
2025-05-07T20:32:11.4427935Z     scale_ub=1200.0,
2025-05-07T20:32:11.4428382Z     contiguous=False,
2025-05-07T20:32:11.4428816Z     compiled=False,
2025-05-07T20:32:11.4429212Z )
2025-05-07T20:32:11.4429613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.4430292Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.4430597Z 
2025-05-07T20:32:11.4430673Z     @given(
2025-05-07T20:32:11.4430904Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.4431236Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.4431543Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.4431883Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.4432444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.4432736Z     )
2025-05-07T20:32:11.4433095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.4433559Z     def test_silu_mul_quant(
2025-05-07T20:32:11.4433795Z         self,
2025-05-07T20:32:11.4433985Z         T: int,
2025-05-07T20:32:11.4434183Z         D: int,
2025-05-07T20:32:11.4434394Z         scale_ub: Optional[float],
2025-05-07T20:32:11.4434670Z         contiguous: bool,
2025-05-07T20:32:11.4434911Z         compiled: bool,
2025-05-07T20:32:11.4435130Z     ) -> None:
2025-05-07T20:32:11.4435349Z         torch.manual_seed(2025)
2025-05-07T20:32:11.4435592Z     
2025-05-07T20:32:11.4435869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.4436224Z     
2025-05-07T20:32:11.4436417Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.4436714Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.4437034Z         x = x_sign * x_clamp
2025-05-07T20:32:11.4437281Z         x0 = x[:, :D]
2025-05-07T20:32:11.4437503Z         x1 = x[:, D:]
2025-05-07T20:32:11.4437707Z     
2025-05-07T20:32:11.4437896Z         if contiguous:
2025-05-07T20:32:11.4438136Z             x0 = x0.contiguous()
2025-05-07T20:32:11.4438391Z             x1 = x1.contiguous()
2025-05-07T20:32:11.4438640Z     
2025-05-07T20:32:11.4438837Z         if scale_ub is not None:
2025-05-07T20:32:11.4439109Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.4439458Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.4439777Z             )
2025-05-07T20:32:11.4439964Z         else:
2025-05-07T20:32:11.4440176Z             scale_ub_tensor = None
2025-05-07T20:32:11.4440606Z     
2025-05-07T20:32:11.4440835Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.4441156Z             op = silu_mul_quant
2025-05-07T20:32:11.4441414Z             if compiled:
2025-05-07T20:32:11.4441671Z                 op = torch.compile(op)
2025-05-07T20:32:11.4441969Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.4442255Z     
2025-05-07T20:32:11.4442450Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.4442619Z 
2025-05-07T20:32:11.4442715Z moe/activation_test.py:117: 
2025-05-07T20:32:11.4443018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.4443362Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.4443639Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.4444383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.4445133Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.4445705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.4446430Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.4447147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.4447713Z     kernel = self.compile(
2025-05-07T20:32:11.4448288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.4448983Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.4449396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.4449637Z 
2025-05-07T20:32:11.4449870Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefa2e580>
2025-05-07T20:32:11.4451075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.4452914Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef994820>}
2025-05-07T20:32:11.4454389Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.4455499Z context = <triton._C.libtriton.ir.context object at 0x7f1cef81b570>
2025-05-07T20:32:11.4455804Z 
2025-05-07T20:32:11.4455979Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.4456525Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.4457008Z                            module_map=module_map)
2025-05-07T20:32:11.4457389Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.4457750Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.4458008Z E       ^
2025-05-07T20:32:11.4458506Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.4459002Z 
2025-05-07T20:32:11.4459459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.4460015Z 
2025-05-07T20:32:11.4460123Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.4460544Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.4460966Z     T=16384,
2025-05-07T20:32:11.4461164Z     D=5120,
2025-05-07T20:32:11.4461354Z     scale_ub=1200.0,
2025-05-07T20:32:11.4461575Z     contiguous=True,
2025-05-07T20:32:11.4461796Z     compiled=True,
2025-05-07T20:32:11.4461992Z )
2025-05-07T20:32:11.4462405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.4462929Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.4463222Z 
2025-05-07T20:32:11.4463311Z     @given(
2025-05-07T20:32:11.4463534Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.4463858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.4464176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.4464512Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.4464854Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.4465144Z     )
2025-05-07T20:32:11.4465498Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.4465962Z     def test_silu_mul_quant(
2025-05-07T20:32:11.4466205Z         self,
2025-05-07T20:32:11.4466390Z         T: int,
2025-05-07T20:32:11.4466586Z         D: int,
2025-05-07T20:32:11.4466802Z         scale_ub: Optional[float],
2025-05-07T20:32:11.4467080Z         contiguous: bool,
2025-05-07T20:32:11.4467320Z         compiled: bool,
2025-05-07T20:32:11.4467543Z     ) -> None:
2025-05-07T20:32:11.4467767Z         torch.manual_seed(2025)
2025-05-07T20:32:11.4468003Z     
2025-05-07T20:32:11.4468277Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.4468636Z     
2025-05-07T20:32:11.4468822Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.4469118Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.4469436Z         x = x_sign * x_clamp
2025-05-07T20:32:11.4469777Z         x0 = x[:, :D]
2025-05-07T20:32:11.4470000Z         x1 = x[:, D:]
2025-05-07T20:32:11.4470207Z     
2025-05-07T20:32:11.4470388Z         if contiguous:
2025-05-07T20:32:11.4470618Z             x0 = x0.contiguous()
2025-05-07T20:32:11.4470884Z             x1 = x1.contiguous()
2025-05-07T20:32:11.4471119Z     
2025-05-07T20:32:11.4471311Z         if scale_ub is not None:
2025-05-07T20:32:11.4471591Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.4471927Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.4472244Z             )
2025-05-07T20:32:11.4472527Z         else:
2025-05-07T20:32:11.4472741Z             scale_ub_tensor = None
2025-05-07T20:32:11.4472990Z     
2025-05-07T20:32:11.4473221Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.4473546Z             op = silu_mul_quant
2025-05-07T20:32:11.4473792Z             if compiled:
2025-05-07T20:32:11.4474043Z                 op = torch.compile(op)
2025-05-07T20:32:11.4474347Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.4474626Z     
2025-05-07T20:32:11.4474820Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.4474987Z 
2025-05-07T20:32:11.4475090Z moe/activation_test.py:117: 
2025-05-07T20:32:11.4475387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.4475735Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.4476024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.4476613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.4477212Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.4477922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.4478669Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.4479232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.4480011Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.4480715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.4481283Z     kernel = self.compile(
2025-05-07T20:32:11.4481849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.4482637Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.4483326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.4483567Z 
2025-05-07T20:32:11.4483784Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef836280>
2025-05-07T20:32:11.4484950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.4486453Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef814e50>}
2025-05-07T20:32:11.4487918Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.4489029Z context = <triton._C.libtriton.ir.context object at 0x7f1cef795d30>
2025-05-07T20:32:11.4489336Z 
2025-05-07T20:32:11.4489506Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.4490049Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.4490540Z                            module_map=module_map)
2025-05-07T20:32:11.4490918Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.4491275Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.4491559Z E       ^
2025-05-07T20:32:11.4492051Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.4492541Z 
2025-05-07T20:32:11.4492989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.4493554Z 
2025-05-07T20:32:11.8806082Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.8815127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.8815726Z     T=16384,
2025-05-07T20:32:11.8815940Z     D=5120,
2025-05-07T20:32:11.8816170Z     scale_ub=None,
2025-05-07T20:32:11.8816395Z     contiguous=False,
2025-05-07T20:32:11.8816627Z     compiled=True,
2025-05-07T20:32:11.8816848Z )
2025-05-07T20:32:11.8817190Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.8817717Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.8818023Z 
2025-05-07T20:32:11.8818109Z     @given(
2025-05-07T20:32:11.8818347Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.8818668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.8818989Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.8819345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.8819690Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.8819986Z     )
2025-05-07T20:32:11.8820357Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.8820827Z     def test_silu_mul_quant(
2025-05-07T20:32:11.8821071Z         self,
2025-05-07T20:32:11.8821270Z         T: int,
2025-05-07T20:32:11.8821471Z         D: int,
2025-05-07T20:32:11.8821687Z         scale_ub: Optional[float],
2025-05-07T20:32:11.8821972Z         contiguous: bool,
2025-05-07T20:32:11.8822218Z         compiled: bool,
2025-05-07T20:32:11.8822443Z     ) -> None:
2025-05-07T20:32:11.8822662Z         torch.manual_seed(2025)
2025-05-07T20:32:11.8822912Z     
2025-05-07T20:32:11.8823181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.8823541Z     
2025-05-07T20:32:11.8823740Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.8824208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.8824530Z         x = x_sign * x_clamp
2025-05-07T20:32:11.8824776Z         x0 = x[:, :D]
2025-05-07T20:32:11.8824991Z         x1 = x[:, D:]
2025-05-07T20:32:11.8825199Z     
2025-05-07T20:32:11.8825383Z         if contiguous:
2025-05-07T20:32:11.8825617Z             x0 = x0.contiguous()
2025-05-07T20:32:11.8825871Z             x1 = x1.contiguous()
2025-05-07T20:32:11.8826116Z     
2025-05-07T20:32:11.8826310Z         if scale_ub is not None:
2025-05-07T20:32:11.8826583Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.8826931Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.8827246Z             )
2025-05-07T20:32:11.8827440Z         else:
2025-05-07T20:32:11.8827644Z             scale_ub_tensor = None
2025-05-07T20:32:11.8827892Z     
2025-05-07T20:32:11.8828117Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.8828436Z             op = silu_mul_quant
2025-05-07T20:32:11.8828692Z             if compiled:
2025-05-07T20:32:11.8828943Z                 op = torch.compile(op)
2025-05-07T20:32:11.8829240Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.8829527Z     
2025-05-07T20:32:11.8829886Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.8830054Z 
2025-05-07T20:32:11.8830150Z moe/activation_test.py:117: 
2025-05-07T20:32:11.8830452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.8830796Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.8831077Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.8831671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.8832269Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.8832974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.8833720Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.8834284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.8835147Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.8835856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.8836416Z     kernel = self.compile(
2025-05-07T20:32:11.8836988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.8837684Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.8838089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.8838337Z 
2025-05-07T20:32:11.8838551Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef79ff10>
2025-05-07T20:32:11.8839723Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.8841245Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefa029d0>}
2025-05-07T20:32:11.8842718Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.8843824Z context = <triton._C.libtriton.ir.context object at 0x7f1cef8433f0>
2025-05-07T20:32:11.8844140Z 
2025-05-07T20:32:11.8844309Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.8844862Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.8845437Z                            module_map=module_map)
2025-05-07T20:32:11.8845809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.8846174Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.8846444Z E       ^
2025-05-07T20:32:11.8846931Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.8847428Z 
2025-05-07T20:32:11.8847880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.8848445Z 
2025-05-07T20:32:11.8848548Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.8848977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.8849394Z     T=2048,
2025-05-07T20:32:11.8849582Z     D=5120,
2025-05-07T20:32:11.8849777Z     scale_ub=None,
2025-05-07T20:32:11.8849986Z     contiguous=False,
2025-05-07T20:32:11.8850211Z     compiled=True,
2025-05-07T20:32:11.8850420Z )
2025-05-07T20:32:12.0061982Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.0062742Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:12.0063040Z 
2025-05-07T20:32:12.0063129Z     @given(
2025-05-07T20:32:12.0063366Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.0063699Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.0064023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.0064361Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.0064706Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.0065007Z     )
2025-05-07T20:32:12.0065377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.0065848Z     def test_silu_mul_quant(
2025-05-07T20:32:12.0066098Z         self,
2025-05-07T20:32:12.0066303Z         T: int,
2025-05-07T20:32:12.0066514Z         D: int,
2025-05-07T20:32:12.0066740Z         scale_ub: Optional[float],
2025-05-07T20:32:12.0067024Z         contiguous: bool,
2025-05-07T20:32:12.0067266Z         compiled: bool,
2025-05-07T20:32:12.0067504Z     ) -> None:
2025-05-07T20:32:12.0068015Z         torch.manual_seed(2025)
2025-05-07T20:32:12.0068268Z     
2025-05-07T20:32:12.0068549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.0068909Z     
2025-05-07T20:32:12.0069105Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.0069406Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.0069881Z         x = x_sign * x_clamp
2025-05-07T20:32:12.0070126Z         x0 = x[:, :D]
2025-05-07T20:32:12.0070359Z         x1 = x[:, D:]
2025-05-07T20:32:12.0070579Z     
2025-05-07T20:32:12.0070761Z         if contiguous:
2025-05-07T20:32:12.0071000Z             x0 = x0.contiguous()
2025-05-07T20:32:12.0071269Z             x1 = x1.contiguous()
2025-05-07T20:32:12.0071518Z     
2025-05-07T20:32:12.0071713Z         if scale_ub is not None:
2025-05-07T20:32:12.0071997Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.0072348Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.0072670Z             )
2025-05-07T20:32:12.0072871Z         else:
2025-05-07T20:32:12.0073084Z             scale_ub_tensor = None
2025-05-07T20:32:12.0073339Z     
2025-05-07T20:32:12.0073577Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.0073908Z             op = silu_mul_quant
2025-05-07T20:32:12.0074160Z             if compiled:
2025-05-07T20:32:12.0074414Z                 op = torch.compile(op)
2025-05-07T20:32:12.0074722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.0075003Z     
2025-05-07T20:32:12.0075198Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.0075365Z 
2025-05-07T20:32:12.0075472Z moe/activation_test.py:117: 
2025-05-07T20:32:12.0075777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.0076282Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.0076570Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.0077172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.0077769Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.0078478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.0079219Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.0079787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.0080513Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.0081220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.0081792Z     kernel = self.compile(
2025-05-07T20:32:12.0082365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.0083349Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.0083764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.0084006Z 
2025-05-07T20:32:12.0084227Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef6b2a30>
2025-05-07T20:32:12.0085391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.0086900Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef8b3550>}
2025-05-07T20:32:12.0088365Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.0089604Z context = <triton._C.libtriton.ir.context object at 0x7f1cef88ca30>
2025-05-07T20:32:12.0089914Z 
2025-05-07T20:32:12.0090092Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.0090633Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.0091123Z                            module_map=module_map)
2025-05-07T20:32:12.0091503Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.0091863Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.0092134Z E       ^
2025-05-07T20:32:12.0092626Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.0093115Z 
2025-05-07T20:32:12.0093572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.0094131Z 
2025-05-07T20:32:12.0094235Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.0094671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.0095100Z     T=2048,
2025-05-07T20:32:12.0095286Z     D=5120,
2025-05-07T20:32:12.0095484Z     scale_ub=1200.0,
2025-05-07T20:32:12.0095717Z     contiguous=False,
2025-05-07T20:32:12.0095942Z     compiled=True,
2025-05-07T20:32:12.0096157Z )
2025-05-07T20:32:12.0096489Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.0097012Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.0097303Z 
2025-05-07T20:32:12.0097382Z     @given(
2025-05-07T20:32:12.0097618Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.0097961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.0098481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.0098830Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.0099172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.0099481Z     )
2025-05-07T20:32:12.0099837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.0100306Z     def test_silu_mul_quant(
2025-05-07T20:32:12.0100556Z         self,
2025-05-07T20:32:12.0100746Z         T: int,
2025-05-07T20:32:12.0100950Z         D: int,
2025-05-07T20:32:12.0101179Z         scale_ub: Optional[float],
2025-05-07T20:32:12.0101454Z         contiguous: bool,
2025-05-07T20:32:12.0101703Z         compiled: bool,
2025-05-07T20:32:12.0101935Z     ) -> None:
2025-05-07T20:32:12.0102149Z         torch.manual_seed(2025)
2025-05-07T20:32:12.0102404Z     
2025-05-07T20:32:12.0102690Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.0103043Z     
2025-05-07T20:32:12.0103251Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.0103551Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.0103874Z         x = x_sign * x_clamp
2025-05-07T20:32:12.0104127Z         x0 = x[:, :D]
2025-05-07T20:32:12.0104351Z         x1 = x[:, D:]
2025-05-07T20:32:12.0104562Z     
2025-05-07T20:32:12.0104753Z         if contiguous:
2025-05-07T20:32:12.0104991Z             x0 = x0.contiguous()
2025-05-07T20:32:12.0105259Z             x1 = x1.contiguous()
2025-05-07T20:32:12.0105504Z     
2025-05-07T20:32:12.0105705Z         if scale_ub is not None:
2025-05-07T20:32:12.0105985Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.0106326Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.0106650Z             )
2025-05-07T20:32:12.0106850Z         else:
2025-05-07T20:32:12.0107059Z             scale_ub_tensor = None
2025-05-07T20:32:12.0107321Z     
2025-05-07T20:32:12.0107562Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.0107897Z             op = silu_mul_quant
2025-05-07T20:32:12.0108156Z             if compiled:
2025-05-07T20:32:12.0108409Z                 op = torch.compile(op)
2025-05-07T20:32:12.0108802Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.0109089Z     
2025-05-07T20:32:12.0109288Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.0109460Z 
2025-05-07T20:32:12.0109565Z moe/activation_test.py:117: 
2025-05-07T20:32:12.0109965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.0110319Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.0110612Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.0111198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.0111797Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.0112508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.0113261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.0113827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.0114564Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.0115277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.0115842Z     kernel = self.compile(
2025-05-07T20:32:12.0116424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.0117129Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.0117547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.0117788Z 
2025-05-07T20:32:12.0118001Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefba2610>
2025-05-07T20:32:12.0119753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.0121252Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef654310>}
2025-05-07T20:32:12.0122714Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.0123824Z context = <triton._C.libtriton.ir.context object at 0x7f1cef66e030>
2025-05-07T20:32:12.0124127Z 
2025-05-07T20:32:12.0124299Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.0124852Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.0125349Z                            module_map=module_map)
2025-05-07T20:32:12.0125722Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.0126095Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.0126367Z E       ^
2025-05-07T20:32:12.0126862Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.0127353Z 
2025-05-07T20:32:12.0127803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.0128364Z 
2025-05-07T20:32:12.2383174Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.2383630Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.2384171Z     T=4096,
2025-05-07T20:32:12.2384434Z     D=5120,
2025-05-07T20:32:12.2384625Z     scale_ub=1200.0,
2025-05-07T20:32:12.2384850Z     contiguous=True,
2025-05-07T20:32:12.2385090Z     compiled=True,
2025-05-07T20:32:12.2385292Z )
2025-05-07T20:32:12.2385617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.2386431Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.2386723Z 
2025-05-07T20:32:12.2386801Z     @given(
2025-05-07T20:32:12.2387040Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.2387360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.2387668Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.2388012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.2388351Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.2388650Z     )
2025-05-07T20:32:12.2389007Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.2389525Z     def test_silu_mul_quant(
2025-05-07T20:32:12.2389888Z         self,
2025-05-07T20:32:12.2390088Z         T: int,
2025-05-07T20:32:12.2390287Z         D: int,
2025-05-07T20:32:12.2390507Z         scale_ub: Optional[float],
2025-05-07T20:32:12.2390778Z         contiguous: bool,
2025-05-07T20:32:12.2391034Z         compiled: bool,
2025-05-07T20:32:12.2391265Z     ) -> None:
2025-05-07T20:32:12.2391480Z         torch.manual_seed(2025)
2025-05-07T20:32:12.2391730Z     
2025-05-07T20:32:12.2392009Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.2392365Z     
2025-05-07T20:32:12.2392563Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.2392863Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.2393176Z         x = x_sign * x_clamp
2025-05-07T20:32:12.2393421Z         x0 = x[:, :D]
2025-05-07T20:32:12.2393639Z         x1 = x[:, D:]
2025-05-07T20:32:12.2393853Z     
2025-05-07T20:32:12.2394030Z         if contiguous:
2025-05-07T20:32:12.2394262Z             x0 = x0.contiguous()
2025-05-07T20:32:12.2394529Z             x1 = x1.contiguous()
2025-05-07T20:32:12.2394931Z     
2025-05-07T20:32:12.2395129Z         if scale_ub is not None:
2025-05-07T20:32:12.2395408Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.2395749Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.2396068Z             )
2025-05-07T20:32:12.2396255Z         else:
2025-05-07T20:32:12.2396462Z             scale_ub_tensor = None
2025-05-07T20:32:12.2396715Z     
2025-05-07T20:32:12.2396949Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.2397264Z             op = silu_mul_quant
2025-05-07T20:32:12.2397517Z             if compiled:
2025-05-07T20:32:12.2397767Z                 op = torch.compile(op)
2025-05-07T20:32:12.2398068Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.2398350Z     
2025-05-07T20:32:12.2398546Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.2398714Z 
2025-05-07T20:32:12.2398816Z moe/activation_test.py:117: 
2025-05-07T20:32:12.2399113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.2399465Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.2399758Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.2400343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.2400939Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.2401643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.2402380Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.2402941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.2403669Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.2404375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.2404941Z     kernel = self.compile(
2025-05-07T20:32:12.2405510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.2406296Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.2406712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.2406953Z 
2025-05-07T20:32:12.2407166Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef667c40>
2025-05-07T20:32:12.2408333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.2409850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef73e040>}
2025-05-07T20:32:12.2411317Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.2412430Z context = <triton._C.libtriton.ir.context object at 0x7f1cef708cb0>
2025-05-07T20:32:12.2412734Z 
2025-05-07T20:32:12.2412903Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.2413456Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.2413945Z                            module_map=module_map)
2025-05-07T20:32:12.2414314Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.2414676Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.2414941Z E       ^
2025-05-07T20:32:12.2415431Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.2416011Z 
2025-05-07T20:32:12.2416458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.2417024Z 
2025-05-07T20:32:12.2417132Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.2417561Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.2417978Z     T=128,
2025-05-07T20:32:12.2418165Z     D=5120,
2025-05-07T20:32:12.2418359Z     scale_ub=1200.0,
2025-05-07T20:32:12.2418579Z     contiguous=False,
2025-05-07T20:32:12.2418804Z     compiled=True,
2025-05-07T20:32:12.2419007Z )
2025-05-07T20:32:12.3764520Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3766032Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.3766813Z 
2025-05-07T20:32:12.3767047Z     @given(
2025-05-07T20:32:12.3767497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3768164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3768782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3769216Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3769561Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3769857Z     )
2025-05-07T20:32:12.3770211Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3770678Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3770927Z         self,
2025-05-07T20:32:12.3771123Z         T: int,
2025-05-07T20:32:12.3771314Z         D: int,
2025-05-07T20:32:12.3771536Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3771811Z         contiguous: bool,
2025-05-07T20:32:12.3772044Z         compiled: bool,
2025-05-07T20:32:12.3772268Z     ) -> None:
2025-05-07T20:32:12.3772481Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3772718Z     
2025-05-07T20:32:12.3772992Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3773352Z     
2025-05-07T20:32:12.3773539Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3773828Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3774433Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3774676Z         x0 = x[:, :D]
2025-05-07T20:32:12.3774911Z         x1 = x[:, D:]
2025-05-07T20:32:12.3775122Z     
2025-05-07T20:32:12.3775303Z         if contiguous:
2025-05-07T20:32:12.3775538Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3775801Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3784293Z     
2025-05-07T20:32:12.3784544Z         if scale_ub is not None:
2025-05-07T20:32:12.3784863Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3785236Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3785572Z             )
2025-05-07T20:32:12.3785784Z         else:
2025-05-07T20:32:12.3786012Z             scale_ub_tensor = None
2025-05-07T20:32:12.3786283Z     
2025-05-07T20:32:12.3786546Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3786895Z             op = silu_mul_quant
2025-05-07T20:32:12.3787162Z             if compiled:
2025-05-07T20:32:12.3787438Z                 op = torch.compile(op)
2025-05-07T20:32:12.3787758Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3788053Z     
2025-05-07T20:32:12.3788263Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.3788440Z 
2025-05-07T20:32:12.3788554Z moe/activation_test.py:117: 
2025-05-07T20:32:12.3788878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3789238Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.3789537Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3790239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.3790849Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.3791572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.3792531Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.3793100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3793835Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3794548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3795116Z     kernel = self.compile(
2025-05-07T20:32:12.3795683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3796388Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3796810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3797052Z 
2025-05-07T20:32:12.3797273Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef727fa0>
2025-05-07T20:32:12.3798450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3799962Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef73eca0>}
2025-05-07T20:32:12.3801429Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3802542Z context = <triton._C.libtriton.ir.context object at 0x7f1cef60fc70>
2025-05-07T20:32:12.3802845Z 
2025-05-07T20:32:12.3803020Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3803570Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3804057Z                            module_map=module_map)
2025-05-07T20:32:12.3804557Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3804927Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.3805183Z E       ^
2025-05-07T20:32:12.3805676Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3806162Z 
2025-05-07T20:32:12.3806615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3807169Z 
2025-05-07T20:32:12.3807276Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.3807693Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.3808120Z     T=16384,
2025-05-07T20:32:12.3808310Z     D=7168,
2025-05-07T20:32:12.3808501Z     scale_ub=1200.0,
2025-05-07T20:32:12.3808722Z     contiguous=True,
2025-05-07T20:32:12.3808943Z     compiled=True,
2025-05-07T20:32:12.3809140Z )
2025-05-07T20:32:12.3809473Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.3810000Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:12.3810293Z 
2025-05-07T20:32:12.3810376Z     @given(
2025-05-07T20:32:12.3810599Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.3810921Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.3811240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.3811573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.3811911Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.3812206Z     )
2025-05-07T20:32:12.3812560Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.3813025Z     def test_silu_mul_quant(
2025-05-07T20:32:12.3813354Z         self,
2025-05-07T20:32:12.3813541Z         T: int,
2025-05-07T20:32:12.3813740Z         D: int,
2025-05-07T20:32:12.3813960Z         scale_ub: Optional[float],
2025-05-07T20:32:12.3814234Z         contiguous: bool,
2025-05-07T20:32:12.3814475Z         compiled: bool,
2025-05-07T20:32:12.3814697Z     ) -> None:
2025-05-07T20:32:12.3814916Z         torch.manual_seed(2025)
2025-05-07T20:32:12.3815156Z     
2025-05-07T20:32:12.3815432Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.3815788Z     
2025-05-07T20:32:12.3815974Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.3816271Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.3816592Z         x = x_sign * x_clamp
2025-05-07T20:32:12.3816830Z         x0 = x[:, :D]
2025-05-07T20:32:12.3817050Z         x1 = x[:, D:]
2025-05-07T20:32:12.3817260Z     
2025-05-07T20:32:12.3817441Z         if contiguous:
2025-05-07T20:32:12.3817675Z             x0 = x0.contiguous()
2025-05-07T20:32:12.3817945Z             x1 = x1.contiguous()
2025-05-07T20:32:12.3818185Z     
2025-05-07T20:32:12.3818377Z         if scale_ub is not None:
2025-05-07T20:32:12.3818661Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.3818994Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.3819313Z             )
2025-05-07T20:32:12.3819510Z         else:
2025-05-07T20:32:12.3819712Z             scale_ub_tensor = None
2025-05-07T20:32:12.3819965Z     
2025-05-07T20:32:12.3820187Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.3820512Z             op = silu_mul_quant
2025-05-07T20:32:12.3820759Z             if compiled:
2025-05-07T20:32:12.3821006Z                 op = torch.compile(op)
2025-05-07T20:32:12.3821306Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3821580Z     
2025-05-07T20:32:12.3821766Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.3821928Z 
2025-05-07T20:32:12.3822030Z moe/activation_test.py:117: 
2025-05-07T20:32:12.3822329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3822674Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.3823041Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.3823633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.3824224Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.3824933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.3825673Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.3826230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.3826960Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.3827670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.3828240Z     kernel = self.compile(
2025-05-07T20:32:12.3828807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.3829504Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.3830061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.3830302Z 
2025-05-07T20:32:12.3830515Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef62e9a0>
2025-05-07T20:32:12.3831686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.3833193Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef7fea60>}
2025-05-07T20:32:12.3834750Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.3835860Z context = <triton._C.libtriton.ir.context object at 0x7f1cef4782f0>
2025-05-07T20:32:12.3836166Z 
2025-05-07T20:32:12.3836334Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.3836879Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.3837372Z                            module_map=module_map)
2025-05-07T20:32:12.3837748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.3838109Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.3838377Z E       ^
2025-05-07T20:32:12.3838865Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.3839358Z 
2025-05-07T20:32:12.3839847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.3840426Z 
2025-05-07T20:32:12.8707097Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.8708376Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.8709374Z     T=16384,
2025-05-07T20:32:12.8709606Z     D=5120,
2025-05-07T20:32:12.8709916Z     scale_ub=1200.0,
2025-05-07T20:32:12.8710137Z     contiguous=True,
2025-05-07T20:32:12.8710369Z     compiled=False,
2025-05-07T20:32:12.8710586Z )
2025-05-07T20:32:12.8711010Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.8711740Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:12.8712062Z 
2025-05-07T20:32:12.8712186Z     @given(
2025-05-07T20:32:12.8712508Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.8712983Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.8713413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.8714016Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.8714357Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.8714653Z     )
2025-05-07T20:32:12.8715015Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.8715479Z     def test_silu_mul_quant(
2025-05-07T20:32:12.8715725Z         self,
2025-05-07T20:32:12.8715917Z         T: int,
2025-05-07T20:32:12.8716109Z         D: int,
2025-05-07T20:32:12.8716331Z         scale_ub: Optional[float],
2025-05-07T20:32:12.8716641Z         contiguous: bool,
2025-05-07T20:32:12.8716885Z         compiled: bool,
2025-05-07T20:32:12.8717112Z     ) -> None:
2025-05-07T20:32:12.8717320Z         torch.manual_seed(2025)
2025-05-07T20:32:12.8717567Z     
2025-05-07T20:32:12.8717843Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.8718194Z     
2025-05-07T20:32:12.8718389Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.8718692Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.8719004Z         x = x_sign * x_clamp
2025-05-07T20:32:12.8719250Z         x0 = x[:, :D]
2025-05-07T20:32:12.8719469Z         x1 = x[:, D:]
2025-05-07T20:32:12.8719674Z     
2025-05-07T20:32:12.8719859Z         if contiguous:
2025-05-07T20:32:12.8720092Z             x0 = x0.contiguous()
2025-05-07T20:32:12.8720355Z             x1 = x1.contiguous()
2025-05-07T20:32:12.8720601Z     
2025-05-07T20:32:12.8720794Z         if scale_ub is not None:
2025-05-07T20:32:12.8721073Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.8721414Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.8721732Z             )
2025-05-07T20:32:12.8721931Z         else:
2025-05-07T20:32:12.8722143Z             scale_ub_tensor = None
2025-05-07T20:32:12.8722533Z     
2025-05-07T20:32:12.8722765Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.8723083Z             op = silu_mul_quant
2025-05-07T20:32:12.8723337Z             if compiled:
2025-05-07T20:32:12.8723596Z                 op = torch.compile(op)
2025-05-07T20:32:12.8723894Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8724179Z     
2025-05-07T20:32:12.8724370Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.8724537Z 
2025-05-07T20:32:12.8724644Z moe/activation_test.py:117: 
2025-05-07T20:32:12.8724941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8725292Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.8725585Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8726319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.8727064Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.8727640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.8728378Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.8729086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.8729653Z     kernel = self.compile(
2025-05-07T20:32:12.8730224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.8730914Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.8731326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8731572Z 
2025-05-07T20:32:12.8731784Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef453f40>
2025-05-07T20:32:12.8732953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.8734563Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef6ec550>}
2025-05-07T20:32:12.8736026Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.8737131Z context = <triton._C.libtriton.ir.context object at 0x7f1cef6e9d30>
2025-05-07T20:32:12.8737433Z 
2025-05-07T20:32:12.8737607Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.8738155Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.8738641Z                            module_map=module_map)
2025-05-07T20:32:12.8739022Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.8739386Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.8739645Z E       ^
2025-05-07T20:32:12.8740140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.8740630Z 
2025-05-07T20:32:12.8741082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.8741633Z 
2025-05-07T20:32:12.8741739Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.8742158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.8742581Z     T=1,
2025-05-07T20:32:12.8742763Z     D=7168,
2025-05-07T20:32:12.8742948Z     scale_ub=1200.0,
2025-05-07T20:32:12.8743178Z     contiguous=False,
2025-05-07T20:32:12.8743409Z     compiled=False,
2025-05-07T20:32:12.8743694Z )
2025-05-07T20:32:12.8744022Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.8744535Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:12.8744819Z 
2025-05-07T20:32:12.8744898Z     @given(
2025-05-07T20:32:12.8745130Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.8745456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.8745771Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.8746105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.8746442Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.8746735Z     )
2025-05-07T20:32:12.8747091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.8747557Z     def test_silu_mul_quant(
2025-05-07T20:32:12.8747803Z         self,
2025-05-07T20:32:12.8747988Z         T: int,
2025-05-07T20:32:12.8748186Z         D: int,
2025-05-07T20:32:12.8748415Z         scale_ub: Optional[float],
2025-05-07T20:32:12.8748682Z         contiguous: bool,
2025-05-07T20:32:12.8748926Z         compiled: bool,
2025-05-07T20:32:12.8749149Z     ) -> None:
2025-05-07T20:32:12.8749367Z         torch.manual_seed(2025)
2025-05-07T20:32:12.8749611Z     
2025-05-07T20:32:12.8749977Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.8750336Z     
2025-05-07T20:32:12.8750524Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.8750818Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.8751135Z         x = x_sign * x_clamp
2025-05-07T20:32:12.8751374Z         x0 = x[:, :D]
2025-05-07T20:32:12.8751589Z         x1 = x[:, D:]
2025-05-07T20:32:12.8751796Z     
2025-05-07T20:32:12.8751968Z         if contiguous:
2025-05-07T20:32:12.8752201Z             x0 = x0.contiguous()
2025-05-07T20:32:12.8752464Z             x1 = x1.contiguous()
2025-05-07T20:32:12.8752704Z     
2025-05-07T20:32:12.8752896Z         if scale_ub is not None:
2025-05-07T20:32:12.8753176Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.8753511Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.8753829Z             )
2025-05-07T20:32:12.8754106Z         else:
2025-05-07T20:32:12.8754312Z             scale_ub_tensor = None
2025-05-07T20:32:12.8754572Z     
2025-05-07T20:32:12.8754800Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.8755125Z             op = silu_mul_quant
2025-05-07T20:32:12.8755371Z             if compiled:
2025-05-07T20:32:12.8755617Z                 op = torch.compile(op)
2025-05-07T20:32:12.8755920Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8756197Z     
2025-05-07T20:32:12.8756388Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.8756554Z 
2025-05-07T20:32:12.8756654Z moe/activation_test.py:117: 
2025-05-07T20:32:12.8756951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8757298Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.8757587Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.8758324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.8759072Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.8759641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.8760369Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.8761070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.8761637Z     kernel = self.compile(
2025-05-07T20:32:12.8762208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.8762906Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.8763395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.8763640Z 
2025-05-07T20:32:12.8763860Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef45c460>
2025-05-07T20:32:12.8765028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.8766529Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef7fee50>}
2025-05-07T20:32:12.8767989Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.8769103Z context = <triton._C.libtriton.ir.context object at 0x7f1cef417b70>
2025-05-07T20:32:12.8769420Z 
2025-05-07T20:32:12.8769591Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.8770144Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.8770631Z                            module_map=module_map)
2025-05-07T20:32:12.8771010Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.8771371Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.8771630Z E       ^
2025-05-07T20:32:12.8772121Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:12.8772614Z 
2025-05-07T20:32:12.8773061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:12.8773613Z 
2025-05-07T20:32:12.8773720Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:12.8774148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:12.8774567Z     T=4096,
2025-05-07T20:32:12.8774757Z     D=7168,
2025-05-07T20:32:12.8774944Z     scale_ub=1200.0,
2025-05-07T20:32:12.8775259Z     contiguous=False,
2025-05-07T20:32:12.8775484Z     compiled=True,
2025-05-07T20:32:12.8775685Z )
2025-05-07T20:32:12.9962607Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:12.9963308Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:12.9963637Z 
2025-05-07T20:32:12.9963754Z     @given(
2025-05-07T20:32:12.9964070Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:12.9964511Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:12.9964872Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:12.9965212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:12.9965554Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:12.9965861Z     )
2025-05-07T20:32:12.9966217Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:12.9966686Z     def test_silu_mul_quant(
2025-05-07T20:32:12.9966945Z         self,
2025-05-07T20:32:12.9967136Z         T: int,
2025-05-07T20:32:12.9967340Z         D: int,
2025-05-07T20:32:12.9967566Z         scale_ub: Optional[float],
2025-05-07T20:32:12.9967845Z         contiguous: bool,
2025-05-07T20:32:12.9968084Z         compiled: bool,
2025-05-07T20:32:12.9968313Z     ) -> None:
2025-05-07T20:32:12.9968533Z         torch.manual_seed(2025)
2025-05-07T20:32:12.9968774Z     
2025-05-07T20:32:12.9969055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:12.9969421Z     
2025-05-07T20:32:12.9969614Z         x_sign = torch.sign(x)
2025-05-07T20:32:12.9969915Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:12.9970240Z         x = x_sign * x_clamp
2025-05-07T20:32:12.9970484Z         x0 = x[:, :D]
2025-05-07T20:32:12.9970884Z         x1 = x[:, D:]
2025-05-07T20:32:12.9971095Z     
2025-05-07T20:32:12.9971276Z         if contiguous:
2025-05-07T20:32:12.9971506Z             x0 = x0.contiguous()
2025-05-07T20:32:12.9971773Z             x1 = x1.contiguous()
2025-05-07T20:32:12.9972006Z     
2025-05-07T20:32:12.9972196Z         if scale_ub is not None:
2025-05-07T20:32:12.9972478Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:12.9972821Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:12.9973130Z             )
2025-05-07T20:32:12.9973325Z         else:
2025-05-07T20:32:12.9973537Z             scale_ub_tensor = None
2025-05-07T20:32:12.9973786Z     
2025-05-07T20:32:12.9974014Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:12.9974337Z             op = silu_mul_quant
2025-05-07T20:32:12.9974583Z             if compiled:
2025-05-07T20:32:12.9974834Z                 op = torch.compile(op)
2025-05-07T20:32:12.9975135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.9975417Z     
2025-05-07T20:32:12.9975608Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:12.9975773Z 
2025-05-07T20:32:12.9975883Z moe/activation_test.py:117: 
2025-05-07T20:32:12.9976180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.9976526Z moe/activation_test.py:115: in fn
2025-05-07T20:32:12.9976809Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:12.9977396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:12.9977989Z     return fn(*args, **kwargs)
2025-05-07T20:32:12.9978695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:12.9979434Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:12.9979993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:12.9980724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:12.9981552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:12.9982125Z     kernel = self.compile(
2025-05-07T20:32:12.9982691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:12.9983571Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:12.9983990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:12.9984233Z 
2025-05-07T20:32:12.9984455Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef43c460>
2025-05-07T20:32:12.9985625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:12.9987139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef41eee0>}
2025-05-07T20:32:12.9988609Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:12.9989801Z context = <triton._C.libtriton.ir.context object at 0x7f1cef506230>
2025-05-07T20:32:12.9996555Z 
2025-05-07T20:32:12.9996763Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:12.9997327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:12.9997831Z                            module_map=module_map)
2025-05-07T20:32:12.9998213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:12.9998755Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:12.9999022Z E       ^
2025-05-07T20:32:12.9999531Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.0000028Z 
2025-05-07T20:32:13.0000495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.0001057Z 
2025-05-07T20:32:13.0001171Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.0001603Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.0002035Z     T=128,
2025-05-07T20:32:13.0002235Z     D=7168,
2025-05-07T20:32:13.0002430Z     scale_ub=1200.0,
2025-05-07T20:32:13.0002684Z     contiguous=False,
2025-05-07T20:32:13.0002918Z     compiled=True,
2025-05-07T20:32:13.0003134Z )
2025-05-07T20:32:13.0003459Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.0003993Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:13.0004284Z 
2025-05-07T20:32:13.0004372Z     @given(
2025-05-07T20:32:13.0004604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.0004936Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.0005257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.0005606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.0005945Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.0006249Z     )
2025-05-07T20:32:13.0006618Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.0007085Z     def test_silu_mul_quant(
2025-05-07T20:32:13.0007338Z         self,
2025-05-07T20:32:13.0007540Z         T: int,
2025-05-07T20:32:13.0007738Z         D: int,
2025-05-07T20:32:13.0007966Z         scale_ub: Optional[float],
2025-05-07T20:32:13.0008244Z         contiguous: bool,
2025-05-07T20:32:13.0008492Z         compiled: bool,
2025-05-07T20:32:13.0008723Z     ) -> None:
2025-05-07T20:32:13.0008949Z         torch.manual_seed(2025)
2025-05-07T20:32:13.0009199Z     
2025-05-07T20:32:13.0009598Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.0009959Z     
2025-05-07T20:32:13.0010151Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.0010447Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.0010765Z         x = x_sign * x_clamp
2025-05-07T20:32:13.0011009Z         x0 = x[:, :D]
2025-05-07T20:32:13.0011229Z         x1 = x[:, D:]
2025-05-07T20:32:13.0011435Z     
2025-05-07T20:32:13.0011622Z         if contiguous:
2025-05-07T20:32:13.0011855Z             x0 = x0.contiguous()
2025-05-07T20:32:13.0012112Z             x1 = x1.contiguous()
2025-05-07T20:32:13.0012360Z     
2025-05-07T20:32:13.0012552Z         if scale_ub is not None:
2025-05-07T20:32:13.0012824Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.0013169Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.0013497Z             )
2025-05-07T20:32:13.0013684Z         else:
2025-05-07T20:32:13.0013893Z             scale_ub_tensor = None
2025-05-07T20:32:13.0014155Z     
2025-05-07T20:32:13.0014387Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.0014714Z             op = silu_mul_quant
2025-05-07T20:32:13.0014970Z             if compiled:
2025-05-07T20:32:13.0015224Z                 op = torch.compile(op)
2025-05-07T20:32:13.0015525Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0015808Z     
2025-05-07T20:32:13.0016004Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.0016169Z 
2025-05-07T20:32:13.0016268Z moe/activation_test.py:117: 
2025-05-07T20:32:13.0016572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0016919Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.0017201Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.0017883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.0018487Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.0019199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.0019937Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.0020504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.0021238Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.0021947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.0022516Z     kernel = self.compile(
2025-05-07T20:32:13.0023092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.0023803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.0024216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.0024463Z 
2025-05-07T20:32:13.0024680Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef5d42b0>
2025-05-07T20:32:13.0025850Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.0027357Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef3c9af0>}
2025-05-07T20:32:13.0028818Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.0030125Z context = <triton._C.libtriton.ir.context object at 0x7f1cef4bf830>
2025-05-07T20:32:13.0030435Z 
2025-05-07T20:32:13.0030609Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.0031243Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.0031732Z                            module_map=module_map)
2025-05-07T20:32:13.0032112Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.0032471Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.0032737Z E       ^
2025-05-07T20:32:13.0033225Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.0033720Z 
2025-05-07T20:32:13.0034166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.0034719Z 
2025-05-07T20:32:13.1732864Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.1733484Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.1734070Z     T=2048,
2025-05-07T20:32:13.1734335Z     D=7168,
2025-05-07T20:32:13.1734605Z     scale_ub=None,
2025-05-07T20:32:13.1734897Z     contiguous=True,
2025-05-07T20:32:13.1735209Z     compiled=True,
2025-05-07T20:32:13.1735494Z )
2025-05-07T20:32:13.1735850Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.1736373Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:13.1736659Z 
2025-05-07T20:32:13.1736749Z     @given(
2025-05-07T20:32:13.1736982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.1737313Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.1737633Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.1738008Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.1738523Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.1738823Z     )
2025-05-07T20:32:13.1739179Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.1739653Z     def test_silu_mul_quant(
2025-05-07T20:32:13.1739901Z         self,
2025-05-07T20:32:13.1740095Z         T: int,
2025-05-07T20:32:13.1740295Z         D: int,
2025-05-07T20:32:13.1740516Z         scale_ub: Optional[float],
2025-05-07T20:32:13.1740788Z         contiguous: bool,
2025-05-07T20:32:13.1741030Z         compiled: bool,
2025-05-07T20:32:13.1741256Z     ) -> None:
2025-05-07T20:32:13.1741470Z         torch.manual_seed(2025)
2025-05-07T20:32:13.1741718Z     
2025-05-07T20:32:13.1741998Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.1742350Z     
2025-05-07T20:32:13.1742551Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.1742851Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.1743170Z         x = x_sign * x_clamp
2025-05-07T20:32:13.1743424Z         x0 = x[:, :D]
2025-05-07T20:32:13.1743645Z         x1 = x[:, D:]
2025-05-07T20:32:13.1743850Z     
2025-05-07T20:32:13.1744039Z         if contiguous:
2025-05-07T20:32:13.1744276Z             x0 = x0.contiguous()
2025-05-07T20:32:13.1744544Z             x1 = x1.contiguous()
2025-05-07T20:32:13.1744783Z     
2025-05-07T20:32:13.1744977Z         if scale_ub is not None:
2025-05-07T20:32:13.1745258Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.1745600Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.1745922Z             )
2025-05-07T20:32:13.1746123Z         else:
2025-05-07T20:32:13.1746328Z             scale_ub_tensor = None
2025-05-07T20:32:13.1746593Z     
2025-05-07T20:32:13.1746827Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.1747144Z             op = silu_mul_quant
2025-05-07T20:32:13.1747399Z             if compiled:
2025-05-07T20:32:13.1747652Z                 op = torch.compile(op)
2025-05-07T20:32:13.1747955Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1748240Z     
2025-05-07T20:32:13.1748434Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.1748602Z 
2025-05-07T20:32:13.1748828Z moe/activation_test.py:117: 
2025-05-07T20:32:13.1749129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1749473Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.1749875Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.1750459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:13.1751061Z     return fn(*args, **kwargs)
2025-05-07T20:32:13.1751768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.1752517Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.1753081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.1753809Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.1754525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.1755088Z     kernel = self.compile(
2025-05-07T20:32:13.1755660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.1756360Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.1756778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.1757021Z 
2025-05-07T20:32:13.1757234Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef4f57c0>
2025-05-07T20:32:13.1758409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.1760001Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef4c68b0>}
2025-05-07T20:32:13.1761460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.1762558Z context = <triton._C.libtriton.ir.context object at 0x7f1cef2c11b0>
2025-05-07T20:32:13.1762864Z 
2025-05-07T20:32:13.1763033Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.1763578Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.1764071Z                            module_map=module_map)
2025-05-07T20:32:13.1764436Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.1764801Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.1765064Z E       ^
2025-05-07T20:32:13.1765552Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.1766043Z 
2025-05-07T20:32:13.1766487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.1767044Z 
2025-05-07T20:32:13.1767147Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.1767569Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.1767980Z     T=16384,
2025-05-07T20:32:13.1768173Z     D=5120,
2025-05-07T20:32:13.1768366Z     scale_ub=None,
2025-05-07T20:32:13.1768578Z     contiguous=False,
2025-05-07T20:32:13.1768804Z     compiled=False,
2025-05-07T20:32:13.1769006Z )
2025-05-07T20:32:13.1769325Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.1769849Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:13.1770147Z 
2025-05-07T20:32:13.1770223Z     @given(
2025-05-07T20:32:13.1770564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.1770889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.1771199Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.1771529Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.1771863Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.1772157Z     )
2025-05-07T20:32:13.1772513Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.1772968Z     def test_silu_mul_quant(
2025-05-07T20:32:13.1773206Z         self,
2025-05-07T20:32:13.1773399Z         T: int,
2025-05-07T20:32:13.1773588Z         D: int,
2025-05-07T20:32:13.1773802Z         scale_ub: Optional[float],
2025-05-07T20:32:13.1774077Z         contiguous: bool,
2025-05-07T20:32:13.1774313Z         compiled: bool,
2025-05-07T20:32:13.1774533Z     ) -> None:
2025-05-07T20:32:13.1774749Z         torch.manual_seed(2025)
2025-05-07T20:32:13.1774984Z     
2025-05-07T20:32:13.1775259Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.1775609Z     
2025-05-07T20:32:13.1775794Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.1776083Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.1778289Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.1780472Z 
2025-05-07T20:32:13.1780588Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:13.1780809Z 
2025-05-07T20:32:13.1780918Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.1781335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.1781746Z     T=4096,
2025-05-07T20:32:13.1781928Z     D=7168,
2025-05-07T20:32:13.1782117Z     scale_ub=1200.0,
2025-05-07T20:32:13.1782331Z     contiguous=True,
2025-05-07T20:32:13.1782548Z     compiled=True,
2025-05-07T20:32:13.1782932Z )
2025-05-07T20:32:13.1783250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.1783766Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:13.1784052Z 
2025-05-07T20:32:13.1784135Z     @given(
2025-05-07T20:32:13.1784360Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.1784677Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.1784983Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.1785314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.1785647Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.1785936Z     )
2025-05-07T20:32:13.1786288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.1786746Z     def test_silu_mul_quant(
2025-05-07T20:32:13.1786984Z         self,
2025-05-07T20:32:13.1787174Z         T: int,
2025-05-07T20:32:13.1787366Z         D: int,
2025-05-07T20:32:13.1787584Z         scale_ub: Optional[float],
2025-05-07T20:32:13.1787856Z         contiguous: bool,
2025-05-07T20:32:13.1788091Z         compiled: bool,
2025-05-07T20:32:13.1788313Z     ) -> None:
2025-05-07T20:32:13.1788528Z         torch.manual_seed(2025)
2025-05-07T20:32:13.1788769Z     
2025-05-07T20:32:13.1789038Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.1789398Z     
2025-05-07T20:32:13.1789589Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.1789971Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.1792284Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.1794355Z 
2025-05-07T20:32:13.1794468Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:13.1794685Z 
2025-05-07T20:32:13.1794788Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.1795206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.1795625Z     T=16384,
2025-05-07T20:32:13.1795815Z     D=7168,
2025-05-07T20:32:13.1796001Z     scale_ub=None,
2025-05-07T20:32:13.1796214Z     contiguous=False,
2025-05-07T20:32:13.1796439Z     compiled=False,
2025-05-07T20:32:13.1796639Z )
2025-05-07T20:32:13.2837773Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.2838487Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:13.2838790Z 
2025-05-07T20:32:13.2838869Z     @given(
2025-05-07T20:32:13.2839093Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.2839404Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.2839715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.2840101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.2840435Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.2840950Z     )
2025-05-07T20:32:13.2841307Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.2841774Z     def test_silu_mul_quant(
2025-05-07T20:32:13.2842017Z         self,
2025-05-07T20:32:13.2842205Z         T: int,
2025-05-07T20:32:13.2842401Z         D: int,
2025-05-07T20:32:13.2842611Z         scale_ub: Optional[float],
2025-05-07T20:32:13.2842883Z         contiguous: bool,
2025-05-07T20:32:13.2843121Z         compiled: bool,
2025-05-07T20:32:13.2843335Z     ) -> None:
2025-05-07T20:32:13.2843542Z         torch.manual_seed(2025)
2025-05-07T20:32:13.2843780Z     
2025-05-07T20:32:13.2844045Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.2846305Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.2848386Z 
2025-05-07T20:32:13.2848503Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.2848726Z 
2025-05-07T20:32:13.2848825Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.2849245Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.2849662Z     T=2048,
2025-05-07T20:32:13.2849840Z     D=7168,
2025-05-07T20:32:13.2850030Z     scale_ub=1200.0,
2025-05-07T20:32:13.2850249Z     contiguous=True,
2025-05-07T20:32:13.2850466Z     compiled=True,
2025-05-07T20:32:13.2850659Z )
2025-05-07T20:32:13.2850979Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.2851502Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:13.2851793Z 
2025-05-07T20:32:13.2851871Z     @given(
2025-05-07T20:32:13.2852211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.2852527Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.2852836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.2853168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.2853499Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.2853792Z     )
2025-05-07T20:32:13.2854152Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.2854606Z     def test_silu_mul_quant(
2025-05-07T20:32:13.2854846Z         self,
2025-05-07T20:32:13.2855039Z         T: int,
2025-05-07T20:32:13.2855228Z         D: int,
2025-05-07T20:32:13.2855439Z         scale_ub: Optional[float],
2025-05-07T20:32:13.2855712Z         contiguous: bool,
2025-05-07T20:32:13.2855952Z         compiled: bool,
2025-05-07T20:32:13.2856164Z     ) -> None:
2025-05-07T20:32:13.2856373Z         torch.manual_seed(2025)
2025-05-07T20:32:13.2856616Z     
2025-05-07T20:32:13.2856885Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.2857239Z     
2025-05-07T20:32:13.2857426Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.2857713Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.2859944Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.2862083Z 
2025-05-07T20:32:13.2862197Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:13.2862418Z 
2025-05-07T20:32:13.2862519Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.2862937Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.2863352Z     T=2048,
2025-05-07T20:32:13.2863533Z     D=7168,
2025-05-07T20:32:13.2863720Z     scale_ub=None,
2025-05-07T20:32:13.2863922Z     contiguous=True,
2025-05-07T20:32:13.2864147Z     compiled=False,
2025-05-07T20:32:13.2864342Z )
2025-05-07T20:32:13.2864655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.2865166Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.2865448Z 
2025-05-07T20:32:13.2865527Z     @given(
2025-05-07T20:32:13.2865746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.2866056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.2866368Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.2866701Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.2867050Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.2867338Z     )
2025-05-07T20:32:13.2867693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.2868147Z     def test_silu_mul_quant(
2025-05-07T20:32:13.2868387Z         self,
2025-05-07T20:32:13.2868574Z         T: int,
2025-05-07T20:32:13.2868766Z         D: int,
2025-05-07T20:32:13.2868975Z         scale_ub: Optional[float],
2025-05-07T20:32:13.2869242Z         contiguous: bool,
2025-05-07T20:32:13.2869478Z         compiled: bool,
2025-05-07T20:32:13.2869864Z     ) -> None:
2025-05-07T20:32:13.2870077Z         torch.manual_seed(2025)
2025-05-07T20:32:13.2870314Z     
2025-05-07T20:32:13.2870575Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.2870933Z     
2025-05-07T20:32:13.2871120Z >       x_sign = torch.sign(x)
2025-05-07T20:32:13.2873322Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.2875373Z 
2025-05-07T20:32:13.2875487Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:13.2875709Z 
2025-05-07T20:32:13.2875809Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.2876225Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.2876648Z     T=1,
2025-05-07T20:32:13.2876822Z     D=7168,
2025-05-07T20:32:13.2877006Z     scale_ub=1200.0,
2025-05-07T20:32:13.2877221Z     contiguous=True,
2025-05-07T20:32:13.2877430Z     compiled=False,
2025-05-07T20:32:13.2877627Z )
2025-05-07T20:32:13.4442647Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.4444137Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:13.4444901Z 
2025-05-07T20:32:13.4445123Z     @given(
2025-05-07T20:32:13.4445716Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.4446355Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.4446977Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.4447653Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.4448319Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.4448924Z     )
2025-05-07T20:32:13.4449626Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.4450298Z     def test_silu_mul_quant(
2025-05-07T20:32:13.4450548Z         self,
2025-05-07T20:32:13.4450749Z         T: int,
2025-05-07T20:32:13.4450949Z         D: int,
2025-05-07T20:32:13.4451183Z         scale_ub: Optional[float],
2025-05-07T20:32:13.4451460Z         contiguous: bool,
2025-05-07T20:32:13.4451703Z         compiled: bool,
2025-05-07T20:32:13.4458214Z     ) -> None:
2025-05-07T20:32:13.4458434Z         torch.manual_seed(2025)
2025-05-07T20:32:13.4458687Z     
2025-05-07T20:32:13.4458958Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.4459306Z     
2025-05-07T20:32:13.4459499Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.4459791Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.4460104Z         x = x_sign * x_clamp
2025-05-07T20:32:13.4460343Z         x0 = x[:, :D]
2025-05-07T20:32:13.4460553Z         x1 = x[:, D:]
2025-05-07T20:32:13.4460758Z     
2025-05-07T20:32:13.4460953Z         if contiguous:
2025-05-07T20:32:13.4461189Z             x0 = x0.contiguous()
2025-05-07T20:32:13.4461442Z             x1 = x1.contiguous()
2025-05-07T20:32:13.4461676Z     
2025-05-07T20:32:13.4461866Z         if scale_ub is not None:
2025-05-07T20:32:13.4462137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.4462477Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.4462798Z             )
2025-05-07T20:32:13.4462986Z         else:
2025-05-07T20:32:13.4463189Z             scale_ub_tensor = None
2025-05-07T20:32:13.4463446Z     
2025-05-07T20:32:13.4463677Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.4463997Z             op = silu_mul_quant
2025-05-07T20:32:13.4464250Z             if compiled:
2025-05-07T20:32:13.4464495Z                 op = torch.compile(op)
2025-05-07T20:32:13.4464793Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4465073Z     
2025-05-07T20:32:13.4465268Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.4465441Z 
2025-05-07T20:32:13.4465536Z moe/activation_test.py:117: 
2025-05-07T20:32:13.4465833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4466328Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.4466616Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4467349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.4468094Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.4468649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.4469376Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.4470240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.4470801Z     kernel = self.compile(
2025-05-07T20:32:13.4471367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.4472064Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.4472475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4472714Z 
2025-05-07T20:32:13.4472933Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef27f430>
2025-05-07T20:32:13.4474099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.4475599Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef34f550>}
2025-05-07T20:32:13.4477063Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.4478256Z context = <triton._C.libtriton.ir.context object at 0x7f1cef34ad70>
2025-05-07T20:32:13.4478558Z 
2025-05-07T20:32:13.4478730Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.4479267Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.4479751Z                            module_map=module_map)
2025-05-07T20:32:13.4480118Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.4480473Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.4480735Z E       ^
2025-05-07T20:32:13.4481221Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.4481707Z 
2025-05-07T20:32:13.4482155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.4482713Z 
2025-05-07T20:32:13.4482999Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.4483428Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.4483844Z     T=128,
2025-05-07T20:32:13.4484019Z     D=5120,
2025-05-07T20:32:13.4484204Z     scale_ub=None,
2025-05-07T20:32:13.4484416Z     contiguous=True,
2025-05-07T20:32:13.4484627Z     compiled=False,
2025-05-07T20:32:13.4484825Z )
2025-05-07T20:32:13.4485143Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.4485644Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.4485929Z 
2025-05-07T20:32:13.4486005Z     @given(
2025-05-07T20:32:13.4486222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.4486537Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.4486848Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.4487180Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.4487511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.4487925Z     )
2025-05-07T20:32:13.4488284Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.4488749Z     def test_silu_mul_quant(
2025-05-07T20:32:13.4488989Z         self,
2025-05-07T20:32:13.4489176Z         T: int,
2025-05-07T20:32:13.4489367Z         D: int,
2025-05-07T20:32:13.4489575Z         scale_ub: Optional[float],
2025-05-07T20:32:13.4489846Z         contiguous: bool,
2025-05-07T20:32:13.4490082Z         compiled: bool,
2025-05-07T20:32:13.4490298Z     ) -> None:
2025-05-07T20:32:13.4490502Z         torch.manual_seed(2025)
2025-05-07T20:32:13.4490737Z     
2025-05-07T20:32:13.4491004Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.4491351Z     
2025-05-07T20:32:13.4491543Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.4491839Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.4492148Z         x = x_sign * x_clamp
2025-05-07T20:32:13.4492383Z         x0 = x[:, :D]
2025-05-07T20:32:13.4492607Z         x1 = x[:, D:]
2025-05-07T20:32:13.4492804Z     
2025-05-07T20:32:13.4492986Z         if contiguous:
2025-05-07T20:32:13.4493210Z             x0 = x0.contiguous()
2025-05-07T20:32:13.4493468Z             x1 = x1.contiguous()
2025-05-07T20:32:13.4493706Z     
2025-05-07T20:32:13.4493894Z         if scale_ub is not None:
2025-05-07T20:32:13.4494161Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.4494499Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.4494816Z             )
2025-05-07T20:32:13.4495003Z         else:
2025-05-07T20:32:13.4495203Z             scale_ub_tensor = None
2025-05-07T20:32:13.4495454Z     
2025-05-07T20:32:13.4495681Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.4496119Z             op = silu_mul_quant
2025-05-07T20:32:13.4496370Z             if compiled:
2025-05-07T20:32:13.4496609Z                 op = torch.compile(op)
2025-05-07T20:32:13.4496907Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4497181Z     
2025-05-07T20:32:13.4497367Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.4497534Z 
2025-05-07T20:32:13.4497630Z moe/activation_test.py:117: 
2025-05-07T20:32:13.4497923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4498262Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.4498539Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.4499273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.4500009Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.4500567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.4501295Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.4501997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.4502561Z     kernel = self.compile(
2025-05-07T20:32:13.4503124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.4503815Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.4504220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.4504459Z 
2025-05-07T20:32:13.4504674Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef36d8e0>
2025-05-07T20:32:13.4505840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.4507425Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef214040>}
2025-05-07T20:32:13.4508890Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.4510120Z context = <triton._C.libtriton.ir.context object at 0x7f1cef207af0>
2025-05-07T20:32:13.4510425Z 
2025-05-07T20:32:13.4510600Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.4511140Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.4511625Z                            module_map=module_map)
2025-05-07T20:32:13.4511996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.4512356Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.4512612Z E       ^
2025-05-07T20:32:13.4513104Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.4513589Z 
2025-05-07T20:32:13.4514038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.4514591Z 
2025-05-07T20:32:13.4514691Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.4515115Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.4515537Z     T=128,
2025-05-07T20:32:13.4515727Z     D=7168,
2025-05-07T20:32:13.4515911Z     scale_ub=None,
2025-05-07T20:32:13.4516121Z     contiguous=True,
2025-05-07T20:32:13.4516343Z     compiled=False,
2025-05-07T20:32:13.4516539Z )
2025-05-07T20:32:13.5398864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5400379Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.5401140Z 
2025-05-07T20:32:13.5401357Z     @given(
2025-05-07T20:32:13.5401900Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5402530Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5403145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5403810Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5404464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5405032Z     )
2025-05-07T20:32:13.5405738Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5406653Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5407122Z         self,
2025-05-07T20:32:13.5407494Z         T: int,
2025-05-07T20:32:13.5407865Z         D: int,
2025-05-07T20:32:13.5408293Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5408833Z         contiguous: bool,
2025-05-07T20:32:13.5409306Z         compiled: bool,
2025-05-07T20:32:13.5409563Z     ) -> None:
2025-05-07T20:32:13.5409774Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5410009Z     
2025-05-07T20:32:13.5410279Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5410632Z     
2025-05-07T20:32:13.5410818Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.5411111Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.5411428Z         x = x_sign * x_clamp
2025-05-07T20:32:13.5411668Z         x0 = x[:, :D]
2025-05-07T20:32:13.5411878Z         x1 = x[:, D:]
2025-05-07T20:32:13.5412081Z     
2025-05-07T20:32:13.5412265Z         if contiguous:
2025-05-07T20:32:13.5412488Z             x0 = x0.contiguous()
2025-05-07T20:32:13.5412747Z             x1 = x1.contiguous()
2025-05-07T20:32:13.5412990Z     
2025-05-07T20:32:13.5413176Z         if scale_ub is not None:
2025-05-07T20:32:13.5413453Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.5413799Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.5414106Z             )
2025-05-07T20:32:13.5414289Z         else:
2025-05-07T20:32:13.5414500Z             scale_ub_tensor = None
2025-05-07T20:32:13.5414876Z     
2025-05-07T20:32:13.5415116Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.5415433Z             op = silu_mul_quant
2025-05-07T20:32:13.5415682Z             if compiled:
2025-05-07T20:32:13.5415925Z                 op = torch.compile(op)
2025-05-07T20:32:13.5416221Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5416498Z     
2025-05-07T20:32:13.5416686Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.5416849Z 
2025-05-07T20:32:13.5416947Z moe/activation_test.py:117: 
2025-05-07T20:32:13.5417237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5417577Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.5417861Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5418601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.5419347Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.5419924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.5420658Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.5421358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.5421923Z     kernel = self.compile(
2025-05-07T20:32:13.5422489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.5423185Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.5423709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5424217Z 
2025-05-07T20:32:13.5424432Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef57f280>
2025-05-07T20:32:13.5425615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.5427115Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef214c10>}
2025-05-07T20:32:13.5428577Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.5429827Z context = <triton._C.libtriton.ir.context object at 0x7f1cef18d8f0>
2025-05-07T20:32:13.5430138Z 
2025-05-07T20:32:13.5430307Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.5430861Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.5431347Z                            module_map=module_map)
2025-05-07T20:32:13.5431719Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.5432077Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.5432330Z E       ^
2025-05-07T20:32:13.5432816Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.5433306Z 
2025-05-07T20:32:13.5433751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.5434305Z 
2025-05-07T20:32:13.5434409Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5434826Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5435240Z     T=2048,
2025-05-07T20:32:13.5435426Z     D=7168,
2025-05-07T20:32:13.5435608Z     scale_ub=1200.0,
2025-05-07T20:32:13.5435824Z     contiguous=True,
2025-05-07T20:32:13.5436038Z     compiled=False,
2025-05-07T20:32:13.5436232Z )
2025-05-07T20:32:13.5436642Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5437164Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:13.5437454Z 
2025-05-07T20:32:13.5437534Z     @given(
2025-05-07T20:32:13.5437756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5438069Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5438378Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5438708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5439042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5439330Z     )
2025-05-07T20:32:13.5439686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5440153Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5440394Z         self,
2025-05-07T20:32:13.5440583Z         T: int,
2025-05-07T20:32:13.5440769Z         D: int,
2025-05-07T20:32:13.5440990Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5441260Z         contiguous: bool,
2025-05-07T20:32:13.5441492Z         compiled: bool,
2025-05-07T20:32:13.5441709Z     ) -> None:
2025-05-07T20:32:13.5441922Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5442158Z     
2025-05-07T20:32:13.5442430Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5444672Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.5446810Z 
2025-05-07T20:32:13.5446936Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.5447153Z 
2025-05-07T20:32:13.5447257Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5447678Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5448099Z     T=1,
2025-05-07T20:32:13.5448287Z     D=5120,
2025-05-07T20:32:13.5448476Z     scale_ub=1200.0,
2025-05-07T20:32:13.5448694Z     contiguous=True,
2025-05-07T20:32:13.5448911Z     compiled=False,
2025-05-07T20:32:13.5449104Z )
2025-05-07T20:32:13.5927539Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5928318Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:13.5928704Z 
2025-05-07T20:32:13.5928821Z     @given(
2025-05-07T20:32:13.5929131Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5929643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5930286Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5930954Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5931622Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5932199Z     )
2025-05-07T20:32:13.5932904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5933815Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5934280Z         self,
2025-05-07T20:32:13.5934645Z         T: int,
2025-05-07T20:32:13.5935019Z         D: int,
2025-05-07T20:32:13.5935436Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5935966Z         contiguous: bool,
2025-05-07T20:32:13.5936431Z         compiled: bool,
2025-05-07T20:32:13.5936855Z     ) -> None:
2025-05-07T20:32:13.5937267Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5937740Z     
2025-05-07T20:32:13.5938261Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5938957Z     
2025-05-07T20:32:13.5939531Z         x_sign = torch.sign(x)
2025-05-07T20:32:13.5939867Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:13.5940182Z         x = x_sign * x_clamp
2025-05-07T20:32:13.5940419Z         x0 = x[:, :D]
2025-05-07T20:32:13.5940635Z         x1 = x[:, D:]
2025-05-07T20:32:13.5940834Z     
2025-05-07T20:32:13.5941014Z         if contiguous:
2025-05-07T20:32:13.5941240Z             x0 = x0.contiguous()
2025-05-07T20:32:13.5941492Z             x1 = x1.contiguous()
2025-05-07T20:32:13.5941732Z     
2025-05-07T20:32:13.5941922Z         if scale_ub is not None:
2025-05-07T20:32:13.5942189Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:13.5942525Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:13.5942839Z             )
2025-05-07T20:32:13.5943027Z         else:
2025-05-07T20:32:13.5943235Z             scale_ub_tensor = None
2025-05-07T20:32:13.5943485Z     
2025-05-07T20:32:13.5943709Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:13.5944034Z             op = silu_mul_quant
2025-05-07T20:32:13.5944278Z             if compiled:
2025-05-07T20:32:13.5944516Z                 op = torch.compile(op)
2025-05-07T20:32:13.5944815Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5945099Z     
2025-05-07T20:32:13.5945283Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:13.5945452Z 
2025-05-07T20:32:13.5945547Z moe/activation_test.py:117: 
2025-05-07T20:32:13.5945842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5946180Z moe/activation_test.py:115: in fn
2025-05-07T20:32:13.5946457Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:13.5947196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:13.5948054Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:13.5948617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:13.5949351Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:13.5950166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:13.5950731Z     kernel = self.compile(
2025-05-07T20:32:13.5951291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:13.5951981Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:13.5952389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:13.5952629Z 
2025-05-07T20:32:13.5952844Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef1838e0>
2025-05-07T20:32:13.5954018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:13.5955514Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef3089d0>}
2025-05-07T20:32:13.5956972Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:13.5958077Z context = <triton._C.libtriton.ir.context object at 0x7f1cef15e070>
2025-05-07T20:32:13.5958379Z 
2025-05-07T20:32:13.5958548Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:13.5959091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:13.5959581Z                            module_map=module_map)
2025-05-07T20:32:13.5959954Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:13.5960391Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:13.5960652Z E       ^
2025-05-07T20:32:13.5961139Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:13.5961625Z 
2025-05-07T20:32:13.5962073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:13.5962625Z 
2025-05-07T20:32:13.5962728Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5963152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5963570Z     T=2048,
2025-05-07T20:32:13.5963746Z     D=5120,
2025-05-07T20:32:13.5963933Z     scale_ub=None,
2025-05-07T20:32:13.5964140Z     contiguous=True,
2025-05-07T20:32:13.5964361Z     compiled=False,
2025-05-07T20:32:13.5964562Z )
2025-05-07T20:32:13.5964882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5965395Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.5965686Z 
2025-05-07T20:32:13.5965761Z     @given(
2025-05-07T20:32:13.5965983Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5966299Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5966605Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5966937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5967269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5967552Z     )
2025-05-07T20:32:13.5967910Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5968368Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5968604Z         self,
2025-05-07T20:32:13.5968879Z         T: int,
2025-05-07T20:32:13.5969068Z         D: int,
2025-05-07T20:32:13.5969275Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5969547Z         contiguous: bool,
2025-05-07T20:32:13.5969793Z         compiled: bool,
2025-05-07T20:32:13.5970014Z     ) -> None:
2025-05-07T20:32:13.5970223Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5970459Z     
2025-05-07T20:32:13.5970732Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5971082Z     
2025-05-07T20:32:13.5971271Z >       x_sign = torch.sign(x)
2025-05-07T20:32:13.5973416Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.5975468Z 
2025-05-07T20:32:13.5975595Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:13.5975813Z 
2025-05-07T20:32:13.5975913Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5976330Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5976748Z     T=16384,
2025-05-07T20:32:13.5976931Z     D=5120,
2025-05-07T20:32:13.5977112Z     scale_ub=None,
2025-05-07T20:32:13.5977316Z     contiguous=True,
2025-05-07T20:32:13.5977532Z     compiled=False,
2025-05-07T20:32:13.5977725Z )
2025-05-07T20:32:13.5978041Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.5978555Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.5978851Z 
2025-05-07T20:32:13.5978924Z     @given(
2025-05-07T20:32:13.5979150Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.5979461Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.5979766Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.5980212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.5987283Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.5987625Z     )
2025-05-07T20:32:13.5987999Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.5988467Z     def test_silu_mul_quant(
2025-05-07T20:32:13.5988713Z         self,
2025-05-07T20:32:13.5988911Z         T: int,
2025-05-07T20:32:13.5989100Z         D: int,
2025-05-07T20:32:13.5989320Z         scale_ub: Optional[float],
2025-05-07T20:32:13.5989596Z         contiguous: bool,
2025-05-07T20:32:13.5989927Z         compiled: bool,
2025-05-07T20:32:13.5990149Z     ) -> None:
2025-05-07T20:32:13.5990365Z         torch.manual_seed(2025)
2025-05-07T20:32:13.5990617Z     
2025-05-07T20:32:13.5990888Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.5993157Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.5995219Z 
2025-05-07T20:32:13.5995336Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.5995553Z 
2025-05-07T20:32:13.5995660Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.5996092Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.5996672Z     T=4096,
2025-05-07T20:32:13.5996854Z     D=5120,
2025-05-07T20:32:13.5997039Z     scale_ub=None,
2025-05-07T20:32:13.5997245Z     contiguous=True,
2025-05-07T20:32:13.5997466Z     compiled=False,
2025-05-07T20:32:13.5997676Z )
2025-05-07T20:32:13.7016559Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.7017371Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.7017795Z 
2025-05-07T20:32:13.7017910Z     @given(
2025-05-07T20:32:13.7018231Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.7018660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.7019053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.7019392Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.7019731Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.7020027Z     )
2025-05-07T20:32:13.7020379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.7020855Z     def test_silu_mul_quant(
2025-05-07T20:32:13.7021099Z         self,
2025-05-07T20:32:13.7021282Z         T: int,
2025-05-07T20:32:13.7021481Z         D: int,
2025-05-07T20:32:13.7021700Z         scale_ub: Optional[float],
2025-05-07T20:32:13.7021972Z         contiguous: bool,
2025-05-07T20:32:13.7022208Z         compiled: bool,
2025-05-07T20:32:13.7022434Z     ) -> None:
2025-05-07T20:32:13.7022647Z         torch.manual_seed(2025)
2025-05-07T20:32:13.7022894Z     
2025-05-07T20:32:13.7023168Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.7025421Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.7027663Z 
2025-05-07T20:32:13.7027785Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.7028005Z 
2025-05-07T20:32:13.7028108Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.7028533Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.7028956Z     T=2048,
2025-05-07T20:32:13.7029135Z     D=5120,
2025-05-07T20:32:13.7029324Z     scale_ub=None,
2025-05-07T20:32:13.7029538Z     contiguous=False,
2025-05-07T20:32:13.7029893Z     compiled=False,
2025-05-07T20:32:13.7030101Z )
2025-05-07T20:32:13.7030421Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.7030932Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:13.7031231Z 
2025-05-07T20:32:13.7031305Z     @given(
2025-05-07T20:32:13.7031532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.7031846Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.7032154Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.7032496Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.7032828Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.7033110Z     )
2025-05-07T20:32:13.7033465Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.7033927Z     def test_silu_mul_quant(
2025-05-07T20:32:13.7034163Z         self,
2025-05-07T20:32:13.7034352Z         T: int,
2025-05-07T20:32:13.7034547Z         D: int,
2025-05-07T20:32:13.7034769Z         scale_ub: Optional[float],
2025-05-07T20:32:13.7035036Z         contiguous: bool,
2025-05-07T20:32:13.7035279Z         compiled: bool,
2025-05-07T20:32:13.7035495Z     ) -> None:
2025-05-07T20:32:13.7035835Z         torch.manual_seed(2025)
2025-05-07T20:32:13.7036078Z     
2025-05-07T20:32:13.7036349Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.7038586Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.7040640Z 
2025-05-07T20:32:13.7040756Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.7040978Z 
2025-05-07T20:32:13.7041078Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.7041498Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.7041917Z     T=4096,
2025-05-07T20:32:13.7042091Z     D=7168,
2025-05-07T20:32:13.7042276Z     scale_ub=None,
2025-05-07T20:32:13.7042487Z     contiguous=True,
2025-05-07T20:32:13.7042695Z     compiled=True,
2025-05-07T20:32:13.7042891Z )
2025-05-07T20:32:13.7043211Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.7043715Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:13.7044002Z 
2025-05-07T20:32:13.7044075Z     @given(
2025-05-07T20:32:13.7044300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.7044798Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.7045135Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.7045504Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.7045874Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.7046194Z     )
2025-05-07T20:32:13.7046590Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.7047110Z     def test_silu_mul_quant(
2025-05-07T20:32:13.7047365Z         self,
2025-05-07T20:32:13.7047649Z         T: int,
2025-05-07T20:32:13.7047854Z         D: int,
2025-05-07T20:32:13.7048079Z         scale_ub: Optional[float],
2025-05-07T20:32:13.7048372Z         contiguous: bool,
2025-05-07T20:32:13.7048629Z         compiled: bool,
2025-05-07T20:32:13.7048856Z     ) -> None:
2025-05-07T20:32:13.7049076Z         torch.manual_seed(2025)
2025-05-07T20:32:13.7049334Z     
2025-05-07T20:32:13.7049619Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.7052289Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.7054712Z 
2025-05-07T20:32:13.7054841Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.7055082Z 
2025-05-07T20:32:13.7055189Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.7055662Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.7056124Z     T=2048,
2025-05-07T20:32:13.7056311Z     D=5120,
2025-05-07T20:32:13.7056508Z     scale_ub=1200.0,
2025-05-07T20:32:13.7056744Z     contiguous=False,
2025-05-07T20:32:13.7056976Z     compiled=False,
2025-05-07T20:32:13.7057190Z )
2025-05-07T20:32:13.7057541Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.7058115Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:13.7058534Z 
2025-05-07T20:32:13.7058613Z     @given(
2025-05-07T20:32:13.7058850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.7059200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.7059532Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.7059910Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.7060284Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.7060596Z     )
2025-05-07T20:32:13.7060992Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.7061517Z     def test_silu_mul_quant(
2025-05-07T20:32:13.7061768Z         self,
2025-05-07T20:32:13.7061970Z         T: int,
2025-05-07T20:32:13.7062176Z         D: int,
2025-05-07T20:32:13.7062398Z         scale_ub: Optional[float],
2025-05-07T20:32:13.7062694Z         contiguous: bool,
2025-05-07T20:32:13.7062958Z         compiled: bool,
2025-05-07T20:32:13.7063201Z     ) -> None:
2025-05-07T20:32:13.7063421Z         torch.manual_seed(2025)
2025-05-07T20:32:13.7063683Z     
2025-05-07T20:32:13.7063980Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.7066608Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.7069023Z 
2025-05-07T20:32:13.7069148Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.7069395Z 
2025-05-07T20:32:13.7069504Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.7070037Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.7070496Z     T=4096,
2025-05-07T20:32:13.7070685Z     D=7168,
2025-05-07T20:32:13.7070965Z     scale_ub=1200.0,
2025-05-07T20:32:13.7071202Z     contiguous=True,
2025-05-07T20:32:13.7071431Z     compiled=False,
2025-05-07T20:32:13.7071648Z )
2025-05-07T20:32:13.7072001Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.7072570Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:13.7072896Z 
2025-05-07T20:32:13.7072973Z     @given(
2025-05-07T20:32:13.7073212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.7073556Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.7073895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.7074261Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.7074631Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.7074943Z     )
2025-05-07T20:32:13.7075335Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.7075857Z     def test_silu_mul_quant(
2025-05-07T20:32:13.7076112Z         self,
2025-05-07T20:32:13.7076310Z         T: int,
2025-05-07T20:32:13.7076520Z         D: int,
2025-05-07T20:32:13.7076742Z         scale_ub: Optional[float],
2025-05-07T20:32:13.7077037Z         contiguous: bool,
2025-05-07T20:32:13.7077291Z         compiled: bool,
2025-05-07T20:32:13.7077522Z     ) -> None:
2025-05-07T20:32:13.7077746Z         torch.manual_seed(2025)
2025-05-07T20:32:13.7078004Z     
2025-05-07T20:32:13.7078290Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.7080931Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.7083328Z 
2025-05-07T20:32:13.7083450Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.7083678Z 
2025-05-07T20:32:13.7083777Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.7084203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.7084718Z     T=16384,
2025-05-07T20:32:13.7084919Z     D=7168,
2025-05-07T20:32:13.7085101Z     scale_ub=None,
2025-05-07T20:32:13.7085306Z     contiguous=False,
2025-05-07T20:32:13.7085528Z     compiled=True,
2025-05-07T20:32:13.7085726Z )
2025-05-07T20:32:13.8371260Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8372060Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:13.8372468Z 
2025-05-07T20:32:13.8372562Z     @given(
2025-05-07T20:32:13.8372793Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8373115Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8373426Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8373753Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8374088Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8374381Z     )
2025-05-07T20:32:13.8374736Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8375202Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8375446Z         self,
2025-05-07T20:32:13.8375634Z         T: int,
2025-05-07T20:32:13.8375836Z         D: int,
2025-05-07T20:32:13.8376055Z         scale_ub: Optional[float],
2025-05-07T20:32:13.8376337Z         contiguous: bool,
2025-05-07T20:32:13.8376574Z         compiled: bool,
2025-05-07T20:32:13.8376799Z     ) -> None:
2025-05-07T20:32:13.8377007Z         torch.manual_seed(2025)
2025-05-07T20:32:13.8377435Z     
2025-05-07T20:32:13.8377712Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.8379968Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.8382032Z 
2025-05-07T20:32:13.8382154Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.8382374Z 
2025-05-07T20:32:13.8382474Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.8383203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.8383629Z     T=4096,
2025-05-07T20:32:13.8383820Z     D=7168,
2025-05-07T20:32:13.8384002Z     scale_ub=None,
2025-05-07T20:32:13.8384212Z     contiguous=True,
2025-05-07T20:32:13.8384427Z     compiled=False,
2025-05-07T20:32:13.8384624Z )
2025-05-07T20:32:13.8384941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8385455Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.8385739Z 
2025-05-07T20:32:13.8385816Z     @given(
2025-05-07T20:32:13.8386040Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8386363Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8386665Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8387125Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8387459Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8387750Z     )
2025-05-07T20:32:13.8388106Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8388567Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8388813Z         self,
2025-05-07T20:32:13.8389000Z         T: int,
2025-05-07T20:32:13.8389195Z         D: int,
2025-05-07T20:32:13.8389411Z         scale_ub: Optional[float],
2025-05-07T20:32:13.8389815Z         contiguous: bool,
2025-05-07T20:32:13.8390057Z         compiled: bool,
2025-05-07T20:32:13.8390277Z     ) -> None:
2025-05-07T20:32:13.8390486Z         torch.manual_seed(2025)
2025-05-07T20:32:13.8390730Z     
2025-05-07T20:32:13.8391000Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.8393248Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.8395308Z 
2025-05-07T20:32:13.8395427Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.8395645Z 
2025-05-07T20:32:13.8395749Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.8396169Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.8396582Z     T=16384,
2025-05-07T20:32:13.8396766Z     D=7168,
2025-05-07T20:32:13.8396954Z     scale_ub=None,
2025-05-07T20:32:13.8397164Z     contiguous=True,
2025-05-07T20:32:13.8397380Z     compiled=False,
2025-05-07T20:32:13.8397587Z )
2025-05-07T20:32:13.8397909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8398417Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:13.8398835Z 
2025-05-07T20:32:13.8398913Z     @given(
2025-05-07T20:32:13.8399136Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8399459Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8399770Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8400112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8400451Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8400741Z     )
2025-05-07T20:32:13.8401108Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8401575Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8401820Z         self,
2025-05-07T20:32:13.8402020Z         T: int,
2025-05-07T20:32:13.8402221Z         D: int,
2025-05-07T20:32:13.8402444Z         scale_ub: Optional[float],
2025-05-07T20:32:13.8402720Z         contiguous: bool,
2025-05-07T20:32:13.8402967Z         compiled: bool,
2025-05-07T20:32:13.8403191Z     ) -> None:
2025-05-07T20:32:13.8403412Z         torch.manual_seed(2025)
2025-05-07T20:32:13.8403654Z     
2025-05-07T20:32:13.8403928Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.8406165Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.8408309Z 
2025-05-07T20:32:13.8408425Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.8408646Z 
2025-05-07T20:32:13.8408748Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.8409173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.8409595Z     T=16384,
2025-05-07T20:32:13.8409786Z     D=7168,
2025-05-07T20:32:13.8409972Z     scale_ub=1200.0,
2025-05-07T20:32:13.8410192Z     contiguous=True,
2025-05-07T20:32:13.8410406Z     compiled=False,
2025-05-07T20:32:13.8410608Z )
2025-05-07T20:32:13.8410930Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:13.8411442Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:13.8411738Z 
2025-05-07T20:32:13.8411812Z     @given(
2025-05-07T20:32:13.8412041Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:13.8412349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:13.8412669Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:13.8413000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:13.8413337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:13.8413627Z     )
2025-05-07T20:32:13.8413983Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:13.8414443Z     def test_silu_mul_quant(
2025-05-07T20:32:13.8414684Z         self,
2025-05-07T20:32:13.8414875Z         T: int,
2025-05-07T20:32:13.8415066Z         D: int,
2025-05-07T20:32:13.8415275Z         scale_ub: Optional[float],
2025-05-07T20:32:13.8415549Z         contiguous: bool,
2025-05-07T20:32:13.8415785Z         compiled: bool,
2025-05-07T20:32:13.8416001Z     ) -> None:
2025-05-07T20:32:13.8416212Z         torch.manual_seed(2025)
2025-05-07T20:32:13.8416458Z     
2025-05-07T20:32:13.8416720Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:13.8419087Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:13.8421152Z 
2025-05-07T20:32:13.8421268Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:13.8421491Z 
2025-05-07T20:32:13.8421591Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:13.8422012Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:13.8422423Z     T=128,
2025-05-07T20:32:13.8422606Z     D=5120,
2025-05-07T20:32:13.8422793Z     scale_ub=1200.0,
2025-05-07T20:32:13.8423013Z     contiguous=False,
2025-05-07T20:32:13.8423235Z     compiled=False,
2025-05-07T20:32:13.8423437Z )
2025-05-07T20:32:14.2249861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2251440Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.2252232Z 
2025-05-07T20:32:14.2252441Z     @given(
2025-05-07T20:32:14.2253029Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2253665Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2254276Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2254945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2255607Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2256181Z     )
2025-05-07T20:32:14.2256895Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2257819Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2258289Z         self,
2025-05-07T20:32:14.2259011Z         T: int,
2025-05-07T20:32:14.2259397Z         D: int,
2025-05-07T20:32:14.2259660Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2259932Z         contiguous: bool,
2025-05-07T20:32:14.2260181Z         compiled: bool,
2025-05-07T20:32:14.2260407Z     ) -> None:
2025-05-07T20:32:14.2260614Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2260858Z     
2025-05-07T20:32:14.2261131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2261481Z     
2025-05-07T20:32:14.2261674Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2261970Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2262280Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2262524Z         x0 = x[:, :D]
2025-05-07T20:32:14.2262740Z         x1 = x[:, D:]
2025-05-07T20:32:14.2262947Z     
2025-05-07T20:32:14.2263126Z         if contiguous:
2025-05-07T20:32:14.2263359Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2263620Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2263861Z     
2025-05-07T20:32:14.2264051Z         if scale_ub is not None:
2025-05-07T20:32:14.2264321Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2264667Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2264983Z             )
2025-05-07T20:32:14.2265173Z         else:
2025-05-07T20:32:14.2265377Z             scale_ub_tensor = None
2025-05-07T20:32:14.2265626Z     
2025-05-07T20:32:14.2265860Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2266175Z             op = silu_mul_quant
2025-05-07T20:32:14.2266426Z             if compiled:
2025-05-07T20:32:14.2266669Z                 op = torch.compile(op)
2025-05-07T20:32:14.2266965Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2267246Z     
2025-05-07T20:32:14.2267437Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2267602Z 
2025-05-07T20:32:14.2267699Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2268009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2268355Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2268639Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2269493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2270430Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2270996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2271724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2272420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2272984Z     kernel = self.compile(
2025-05-07T20:32:14.2273556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2274254Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2274670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2282115Z 
2025-05-07T20:32:14.2282356Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef108490>
2025-05-07T20:32:14.2283730Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2285241Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ceefeb670>}
2025-05-07T20:32:14.2286718Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2287992Z context = <triton._C.libtriton.ir.context object at 0x7f1ceef888b0>
2025-05-07T20:32:14.2288299Z 
2025-05-07T20:32:14.2288481Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2289029Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2289515Z                            module_map=module_map)
2025-05-07T20:32:14.2289889Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2290255Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2290520Z E       ^
2025-05-07T20:32:14.2291009Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2291495Z 
2025-05-07T20:32:14.2291945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2292506Z 
2025-05-07T20:32:14.2292614Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2293029Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2293448Z     T=2048,
2025-05-07T20:32:14.2293638Z     D=7168,
2025-05-07T20:32:14.2293820Z     scale_ub=None,
2025-05-07T20:32:14.2294035Z     contiguous=False,
2025-05-07T20:32:14.2294257Z     compiled=False,
2025-05-07T20:32:14.2294459Z )
2025-05-07T20:32:14.2294783Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2295297Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.2295585Z 
2025-05-07T20:32:14.2295665Z     @given(
2025-05-07T20:32:14.2295886Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2296201Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2296512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2296841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2297185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2297474Z     )
2025-05-07T20:32:14.2297944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2298409Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2298646Z         self,
2025-05-07T20:32:14.2298835Z         T: int,
2025-05-07T20:32:14.2299027Z         D: int,
2025-05-07T20:32:14.2299237Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2299507Z         contiguous: bool,
2025-05-07T20:32:14.2299742Z         compiled: bool,
2025-05-07T20:32:14.2299955Z     ) -> None:
2025-05-07T20:32:14.2300169Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2300410Z     
2025-05-07T20:32:14.2300688Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2302931Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2304981Z 
2025-05-07T20:32:14.2305096Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.2305318Z 
2025-05-07T20:32:14.2305418Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2305840Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2306252Z     T=128,
2025-05-07T20:32:14.2306429Z     D=7168,
2025-05-07T20:32:14.2306616Z     scale_ub=1200.0,
2025-05-07T20:32:14.2306827Z     contiguous=True,
2025-05-07T20:32:14.2307045Z     compiled=True,
2025-05-07T20:32:14.2307243Z )
2025-05-07T20:32:14.2749439Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2750356Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.2750759Z 
2025-05-07T20:32:14.2750882Z     @given(
2025-05-07T20:32:14.2751195Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2751611Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2751935Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2752289Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2752627Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2752919Z     )
2025-05-07T20:32:14.2753284Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2753749Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2753996Z         self,
2025-05-07T20:32:14.2754192Z         T: int,
2025-05-07T20:32:14.2754393Z         D: int,
2025-05-07T20:32:14.2754611Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2754899Z         contiguous: bool,
2025-05-07T20:32:14.2755146Z         compiled: bool,
2025-05-07T20:32:14.2755381Z     ) -> None:
2025-05-07T20:32:14.2755603Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2755860Z     
2025-05-07T20:32:14.2756134Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2756489Z     
2025-05-07T20:32:14.2756685Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2756982Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2757298Z         x = x_sign * x_clamp
2025-05-07T20:32:14.2757544Z         x0 = x[:, :D]
2025-05-07T20:32:14.2757763Z         x1 = x[:, D:]
2025-05-07T20:32:14.2757972Z     
2025-05-07T20:32:14.2758160Z         if contiguous:
2025-05-07T20:32:14.2758397Z             x0 = x0.contiguous()
2025-05-07T20:32:14.2758659Z             x1 = x1.contiguous()
2025-05-07T20:32:14.2758909Z     
2025-05-07T20:32:14.2759102Z         if scale_ub is not None:
2025-05-07T20:32:14.2759389Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.2759733Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.2760062Z             )
2025-05-07T20:32:14.2760473Z         else:
2025-05-07T20:32:14.2760686Z             scale_ub_tensor = None
2025-05-07T20:32:14.2760939Z     
2025-05-07T20:32:14.2761169Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.2761491Z             op = silu_mul_quant
2025-05-07T20:32:14.2761744Z             if compiled:
2025-05-07T20:32:14.2761995Z                 op = torch.compile(op)
2025-05-07T20:32:14.2762292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2762582Z     
2025-05-07T20:32:14.2762773Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.2762938Z 
2025-05-07T20:32:14.2763035Z moe/activation_test.py:117: 
2025-05-07T20:32:14.2763333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2763679Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.2763960Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.2764552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.2765152Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.2765862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.2766600Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.2767166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.2767891Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.2768597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.2769162Z     kernel = self.compile(
2025-05-07T20:32:14.2769735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.2770559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.2770973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.2771220Z 
2025-05-07T20:32:14.2771432Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ceef5bf70>
2025-05-07T20:32:14.2772605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.2774110Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ceefda5e0>}
2025-05-07T20:32:14.2775578Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.2776687Z context = <triton._C.libtriton.ir.context object at 0x7f1ceef23270>
2025-05-07T20:32:14.2776999Z 
2025-05-07T20:32:14.2777166Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.2777717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.2778203Z                            module_map=module_map)
2025-05-07T20:32:14.2778573Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.2778934Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.2779200Z E       ^
2025-05-07T20:32:14.2779686Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.2780177Z 
2025-05-07T20:32:14.2780624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.2781189Z 
2025-05-07T20:32:14.2781289Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2781799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2782217Z     T=128,
2025-05-07T20:32:14.2782411Z     D=7168,
2025-05-07T20:32:14.2782602Z     scale_ub=1200.0,
2025-05-07T20:32:14.2782999Z     contiguous=True,
2025-05-07T20:32:14.2783224Z     compiled=False,
2025-05-07T20:32:14.2783431Z )
2025-05-07T20:32:14.2783750Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2784268Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.2784553Z 
2025-05-07T20:32:14.2784633Z     @given(
2025-05-07T20:32:14.2784865Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2785179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2785493Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2785834Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2786166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2786473Z     )
2025-05-07T20:32:14.2786838Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2787300Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2787538Z         self,
2025-05-07T20:32:14.2787731Z         T: int,
2025-05-07T20:32:14.2787928Z         D: int,
2025-05-07T20:32:14.2788140Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2788414Z         contiguous: bool,
2025-05-07T20:32:14.2788655Z         compiled: bool,
2025-05-07T20:32:14.2788869Z     ) -> None:
2025-05-07T20:32:14.2789084Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2789335Z     
2025-05-07T20:32:14.2789637Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2790100Z     
2025-05-07T20:32:14.2790296Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.2790712Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.2792919Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2794972Z 
2025-05-07T20:32:14.2795090Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:14.2795312Z 
2025-05-07T20:32:14.2795413Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2795834Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2796249Z     T=128,
2025-05-07T20:32:14.2796436Z     D=5120,
2025-05-07T20:32:14.2796629Z     scale_ub=1200.0,
2025-05-07T20:32:14.2796844Z     contiguous=True,
2025-05-07T20:32:14.2797063Z     compiled=True,
2025-05-07T20:32:14.2797263Z )
2025-05-07T20:32:14.2797585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.2798097Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.2798384Z 
2025-05-07T20:32:14.2798462Z     @given(
2025-05-07T20:32:14.2798693Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.2799010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.2799335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.2799678Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.2800011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.2800311Z     )
2025-05-07T20:32:14.2800675Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.2801143Z     def test_silu_mul_quant(
2025-05-07T20:32:14.2801400Z         self,
2025-05-07T20:32:14.2801595Z         T: int,
2025-05-07T20:32:14.2801792Z         D: int,
2025-05-07T20:32:14.2802158Z         scale_ub: Optional[float],
2025-05-07T20:32:14.2802435Z         contiguous: bool,
2025-05-07T20:32:14.2802669Z         compiled: bool,
2025-05-07T20:32:14.2802890Z     ) -> None:
2025-05-07T20:32:14.2803109Z         torch.manual_seed(2025)
2025-05-07T20:32:14.2803345Z     
2025-05-07T20:32:14.2803620Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.2803973Z     
2025-05-07T20:32:14.2804162Z >       x_sign = torch.sign(x)
2025-05-07T20:32:14.2806285Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.2808760Z 
2025-05-07T20:32:14.2808884Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:14.2809108Z 
2025-05-07T20:32:14.2809209Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.2809633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.2810046Z     T=128,
2025-05-07T20:32:14.2810224Z     D=7168,
2025-05-07T20:32:14.2810412Z     scale_ub=None,
2025-05-07T20:32:14.2810614Z     contiguous=True,
2025-05-07T20:32:14.2810832Z     compiled=True,
2025-05-07T20:32:14.2811032Z )
2025-05-07T20:32:14.5856123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.5857563Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.5858471Z 
2025-05-07T20:32:14.5858630Z     @given(
2025-05-07T20:32:14.5859089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.5859694Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.5860050Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.5860391Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.5860724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.5861017Z     )
2025-05-07T20:32:14.5861379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.5861841Z     def test_silu_mul_quant(
2025-05-07T20:32:14.5862079Z         self,
2025-05-07T20:32:14.5862272Z         T: int,
2025-05-07T20:32:14.5862474Z         D: int,
2025-05-07T20:32:14.5862689Z         scale_ub: Optional[float],
2025-05-07T20:32:14.5862965Z         contiguous: bool,
2025-05-07T20:32:14.5863206Z         compiled: bool,
2025-05-07T20:32:14.5863433Z     ) -> None:
2025-05-07T20:32:14.5863653Z         torch.manual_seed(2025)
2025-05-07T20:32:14.5863898Z     
2025-05-07T20:32:14.5864169Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.5866407Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.5868449Z 
2025-05-07T20:32:14.5868567Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.5868788Z 
2025-05-07T20:32:14.5926143Z FAILED
2025-05-07T20:32:14.5926343Z 
2025-05-07T20:32:14.5926526Z =================================== FAILURES ===================================
2025-05-07T20:32:14.5927158Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:14.5927968Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:14.5928861Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:32:14.5929656Z   |     yield
2025-05-07T20:32:14.5930263Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run
2025-05-07T20:32:14.5930997Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:14.5931793Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
2025-05-07T20:32:14.5932562Z   |     method()
2025-05-07T20:32:14.5933466Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:14.5934514Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.5935422Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:14.5936327Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:14.5937012Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:14.5937718Z   +-+---------------- 1 ----------------
2025-05-07T20:32:14.5938118Z     | Traceback (most recent call last):
2025-05-07T20:32:14.5939138Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:14.5940243Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.5943300Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.5946317Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:14.5946782Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.5947215Z     |     T=128,
2025-05-07T20:32:14.5947417Z     |     D=7168,
2025-05-07T20:32:14.5947634Z     |     scale_ub=1200.0,
2025-05-07T20:32:14.5947885Z     |     contiguous=True,
2025-05-07T20:32:14.5948126Z     |     compiled=False,
2025-05-07T20:32:14.5948358Z     | )
2025-05-07T20:32:14.5948542Z     | 
2025-05-07T20:32:14.5949101Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:32:14.5949903Z     +---------------- 2 ----------------
2025-05-07T20:32:14.5950230Z     | Traceback (most recent call last):
2025-05-07T20:32:14.5950994Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:14.5951826Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.5954059Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.5956316Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:14.5956774Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.5957199Z     |     T=128,
2025-05-07T20:32:14.5957392Z     |     D=7168,
2025-05-07T20:32:14.5957599Z     |     scale_ub=None,
2025-05-07T20:32:14.5957837Z     |     contiguous=True,
2025-05-07T20:32:14.5958074Z     |     compiled=True,
2025-05-07T20:32:14.5958293Z     | )
2025-05-07T20:32:14.5958469Z     | 
2025-05-07T20:32:14.5959012Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:14.5959661Z     +---------------- 3 ----------------
2025-05-07T20:32:14.5959953Z     | Traceback (most recent call last):
2025-05-07T20:32:14.5960715Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:14.5961546Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.5964252Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.5966986Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:14.5967610Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.5968313Z     |     T=128,
2025-05-07T20:32:14.5968591Z     |     D=5120,
2025-05-07T20:32:14.5968880Z     |     scale_ub=1200.0,
2025-05-07T20:32:14.5969220Z     |     contiguous=True,
2025-05-07T20:32:14.5969553Z     |     compiled=True,
2025-05-07T20:32:14.5969855Z     | )
2025-05-07T20:32:14.5970107Z     | 
2025-05-07T20:32:14.5970853Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:14.5971748Z     +---------------- 4 ----------------
2025-05-07T20:32:14.5972154Z     | Traceback (most recent call last):
2025-05-07T20:32:14.5973165Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:14.5974183Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.5975128Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:14.5976135Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.5977334Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:14.5978496Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.5979364Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:14.5980409Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.5981464Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:14.5982570Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.5983970Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:14.5985336Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.5986193Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:14.5986941Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6003928Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:14.6004827Z     |     fn()
2025-05-07T20:32:14.6005697Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:14.6006632Z     |     self.fn.run(
2025-05-07T20:32:14.6007419Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:14.6008277Z     |     kernel = self.compile(
2025-05-07T20:32:14.6009156Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:14.6010142Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6011196Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:14.6012411Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6013172Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6013691Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6014080Z     | ^
2025-05-07T20:32:14.6014774Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6015888Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:14.6016482Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:14.6017235Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6017863Z     |     T=1,  # or any other generated value
2025-05-07T20:32:14.6018316Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:14.6018806Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:14.6019336Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:14.6019854Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:14.6020303Z     | )
2025-05-07T20:32:14.6020569Z     | 
2025-05-07T20:32:14.6021349Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:14.6022288Z     +------------------------------------
2025-05-07T20:32:14.6022810Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:14.6023354Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6023957Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6024540Z     T=1,
2025-05-07T20:32:14.6024804Z     D=5120,
2025-05-07T20:32:14.6025072Z     scale_ub=None,
2025-05-07T20:32:14.6025381Z     contiguous=True,
2025-05-07T20:32:14.6025702Z     compiled=True,
2025-05-07T20:32:14.6025996Z )
2025-05-07T20:32:14.6026461Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6027175Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.6027557Z 
2025-05-07T20:32:14.6027671Z     @given(
2025-05-07T20:32:14.6028003Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6028445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6028852Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6029291Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6030011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6030429Z     )
2025-05-07T20:32:14.6030904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6031515Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6031852Z         self,
2025-05-07T20:32:14.6032128Z         T: int,
2025-05-07T20:32:14.6032396Z         D: int,
2025-05-07T20:32:14.6032680Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6033036Z         contiguous: bool,
2025-05-07T20:32:14.6033365Z         compiled: bool,
2025-05-07T20:32:14.6033684Z     ) -> None:
2025-05-07T20:32:14.6033982Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6034329Z     
2025-05-07T20:32:14.6034714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6035215Z     
2025-05-07T20:32:14.6035486Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6035900Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6036339Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6036694Z         x0 = x[:, :D]
2025-05-07T20:32:14.6037003Z         x1 = x[:, D:]
2025-05-07T20:32:14.6037299Z     
2025-05-07T20:32:14.6037553Z         if contiguous:
2025-05-07T20:32:14.6037863Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6038207Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6038540Z     
2025-05-07T20:32:14.6038798Z         if scale_ub is not None:
2025-05-07T20:32:14.6039174Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6039666Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6040120Z             )
2025-05-07T20:32:14.6040379Z         else:
2025-05-07T20:32:14.6040661Z             scale_ub_tensor = None
2025-05-07T20:32:14.6041003Z     
2025-05-07T20:32:14.6041312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6041853Z             op = silu_mul_quant
2025-05-07T20:32:14.6042194Z             if compiled:
2025-05-07T20:32:14.6042550Z                 op = torch.compile(op)
2025-05-07T20:32:14.6042976Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6043382Z     
2025-05-07T20:32:14.6043649Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.6044047Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.6044477Z     
2025-05-07T20:32:14.6044806Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6045280Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.6045707Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.6046162Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.6046684Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6047130Z     
2025-05-07T20:32:14.6047413Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.6047704Z 
2025-05-07T20:32:14.6047852Z moe/activation_test.py:126: 
2025-05-07T20:32:14.6048268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6048763Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.6049235Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6050417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.6051516Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.6052295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6053299Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6054300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.6055378Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6056593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.6057610Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6058527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.6059330Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6060190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.6060959Z     fn()
2025-05-07T20:32:14.6061646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.6062508Z     self.fn.run(
2025-05-07T20:32:14.6063202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6063967Z     kernel = self.compile(
2025-05-07T20:32:14.6064744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6065729Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6066237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6066531Z 
2025-05-07T20:32:14.6066824Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f21ad5190>
2025-05-07T20:32:14.6068321Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6070219Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f214764c0>}
2025-05-07T20:32:14.6072056Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6073392Z context = <triton._C.libtriton.ir.context object at 0x7f1f21b9ddb0>
2025-05-07T20:32:14.6073768Z 
2025-05-07T20:32:14.6073970Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6074644Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6075229Z                            module_map=module_map)
2025-05-07T20:32:14.6075664Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6076109Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6076440Z E       ^
2025-05-07T20:32:14.6077012Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6077604Z 
2025-05-07T20:32:14.6078131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6078782Z 
2025-05-07T20:32:14.6078910Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6079416Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6079914Z     T=2048,
2025-05-07T20:32:14.6080142Z     D=5120,
2025-05-07T20:32:14.6080376Z     scale_ub=1200.0,
2025-05-07T20:32:14.6080640Z     contiguous=True,
2025-05-07T20:32:14.6080910Z     compiled=False,
2025-05-07T20:32:14.6081161Z )
2025-05-07T20:32:14.6081545Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6082164Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.6082512Z 
2025-05-07T20:32:14.6082605Z     @given(
2025-05-07T20:32:14.6083185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6083579Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6083967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6084592Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6085021Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6085378Z     )
2025-05-07T20:32:14.6085811Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6086366Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6086662Z         self,
2025-05-07T20:32:14.6086896Z         T: int,
2025-05-07T20:32:14.6087136Z         D: int,
2025-05-07T20:32:14.6087392Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6087728Z         contiguous: bool,
2025-05-07T20:32:14.6088044Z         compiled: bool,
2025-05-07T20:32:14.6088355Z     ) -> None:
2025-05-07T20:32:14.6088667Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6089027Z     
2025-05-07T20:32:14.6089419Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6089916Z     
2025-05-07T20:32:14.6090157Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6090518Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6090939Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6091248Z         x0 = x[:, :D]
2025-05-07T20:32:14.6091529Z         x1 = x[:, D:]
2025-05-07T20:32:14.6091815Z     
2025-05-07T20:32:14.6092072Z         if contiguous:
2025-05-07T20:32:14.6092363Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6092732Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6093077Z     
2025-05-07T20:32:14.6093337Z         if scale_ub is not None:
2025-05-07T20:32:14.6093730Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6094202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6094638Z             )
2025-05-07T20:32:14.6094894Z         else:
2025-05-07T20:32:14.6095336Z             scale_ub_tensor = None
2025-05-07T20:32:14.6095685Z     
2025-05-07T20:32:14.6095987Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6096427Z             op = silu_mul_quant
2025-05-07T20:32:14.6096779Z             if compiled:
2025-05-07T20:32:14.6097108Z                 op = torch.compile(op)
2025-05-07T20:32:14.6097518Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6097900Z     
2025-05-07T20:32:14.6098149Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6098382Z 
2025-05-07T20:32:14.6098514Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6098918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6099383Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6099760Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6100765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6101784Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6102552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6103569Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6104565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6105335Z     kernel = self.compile(
2025-05-07T20:32:14.6106117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6107069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6107633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6107971Z 
2025-05-07T20:32:14.6108260Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f21aef3d0>
2025-05-07T20:32:14.6109966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6112055Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f214d3ca0>}
2025-05-07T20:32:14.6114038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6115570Z context = <triton._C.libtriton.ir.context object at 0x7f1f213eca70>
2025-05-07T20:32:14.6116002Z 
2025-05-07T20:32:14.6116234Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6116989Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6117672Z                            module_map=module_map)
2025-05-07T20:32:14.6118177Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6118677Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6119052Z E       ^
2025-05-07T20:32:14.6119731Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6120417Z 
2025-05-07T20:32:14.6121030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6123007Z 
2025-05-07T20:32:14.6123147Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6123724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6124293Z     T=2048,
2025-05-07T20:32:14.6124555Z     D=5120,
2025-05-07T20:32:14.6124826Z     scale_ub=1200.0,
2025-05-07T20:32:14.6125127Z     contiguous=True,
2025-05-07T20:32:14.6125443Z     compiled=True,
2025-05-07T20:32:14.6125837Z )
2025-05-07T20:32:14.6126281Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6127006Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.6127392Z 
2025-05-07T20:32:14.6127505Z     @given(
2025-05-07T20:32:14.6127813Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6128258Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6128694Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6129162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6129621Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6130063Z     )
2025-05-07T20:32:14.6130577Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6131198Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6131545Z         self,
2025-05-07T20:32:14.6131808Z         T: int,
2025-05-07T20:32:14.6132063Z         D: int,
2025-05-07T20:32:14.6132367Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6132751Z         contiguous: bool,
2025-05-07T20:32:14.6133088Z         compiled: bool,
2025-05-07T20:32:14.6133417Z     ) -> None:
2025-05-07T20:32:14.6133722Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6134062Z     
2025-05-07T20:32:14.6134450Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6134937Z     
2025-05-07T20:32:14.6135212Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6135615Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6136063Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6136409Z         x0 = x[:, :D]
2025-05-07T20:32:14.6136702Z         x1 = x[:, D:]
2025-05-07T20:32:14.6136998Z     
2025-05-07T20:32:14.6137250Z         if contiguous:
2025-05-07T20:32:14.6137573Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6137951Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6138298Z     
2025-05-07T20:32:14.6138565Z         if scale_ub is not None:
2025-05-07T20:32:14.6138958Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6139440Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6139983Z             )
2025-05-07T20:32:14.6140262Z         else:
2025-05-07T20:32:14.6140553Z             scale_ub_tensor = None
2025-05-07T20:32:14.6140886Z     
2025-05-07T20:32:14.6141212Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6141653Z             op = silu_mul_quant
2025-05-07T20:32:14.6141986Z             if compiled:
2025-05-07T20:32:14.6142334Z                 op = torch.compile(op)
2025-05-07T20:32:14.6142761Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6143161Z     
2025-05-07T20:32:14.6143421Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.6143826Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.6144240Z     
2025-05-07T20:32:14.6144563Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6145044Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.6145455Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.6145897Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.6146417Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6146863Z     
2025-05-07T20:32:14.6147134Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.6147410Z 
2025-05-07T20:32:14.6147535Z moe/activation_test.py:126: 
2025-05-07T20:32:14.6147944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6148378Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.6148794Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6150080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.6151165Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.6152066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6153074Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6154098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.6155181Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6156296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.6157387Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6158442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.6159356Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6160276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.6161032Z     fn()
2025-05-07T20:32:14.6161758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.6162605Z     self.fn.run(
2025-05-07T20:32:14.6163259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6164001Z     kernel = self.compile(
2025-05-07T20:32:14.6164739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6165653Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6166154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6166444Z 
2025-05-07T20:32:14.6166742Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f21b503a0>
2025-05-07T20:32:14.6168507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6170661Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f21ae3af0>}
2025-05-07T20:32:14.6172669Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6174138Z context = <triton._C.libtriton.ir.context object at 0x7f1ee0468e70>
2025-05-07T20:32:14.6174530Z 
2025-05-07T20:32:14.6174754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6175477Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6176178Z                            module_map=module_map)
2025-05-07T20:32:14.6176716Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6177223Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6177596Z E       ^
2025-05-07T20:32:14.6178268Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6178943Z 
2025-05-07T20:32:14.6179564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6180332Z 
2025-05-07T20:32:14.6180482Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6181069Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6181657Z     T=16384,
2025-05-07T20:32:14.6181927Z     D=7168,
2025-05-07T20:32:14.6182308Z     scale_ub=1200.0,
2025-05-07T20:32:14.6182622Z     contiguous=False,
2025-05-07T20:32:14.6183192Z     compiled=False,
2025-05-07T20:32:14.6183474Z )
2025-05-07T20:32:14.6183907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6184595Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.6184980Z 
2025-05-07T20:32:14.6185085Z     @given(
2025-05-07T20:32:14.6185393Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6185819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6186246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6186692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6187139Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6187538Z     )
2025-05-07T20:32:14.6188044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6188693Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6189044Z         self,
2025-05-07T20:32:14.6189303Z         T: int,
2025-05-07T20:32:14.6189562Z         D: int,
2025-05-07T20:32:14.6189962Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6190325Z         contiguous: bool,
2025-05-07T20:32:14.6190641Z         compiled: bool,
2025-05-07T20:32:14.6190936Z     ) -> None:
2025-05-07T20:32:14.6191214Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6191537Z     
2025-05-07T20:32:14.6191892Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6192362Z     
2025-05-07T20:32:14.6192614Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6193001Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6193422Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6193741Z         x0 = x[:, :D]
2025-05-07T20:32:14.6194036Z         x1 = x[:, D:]
2025-05-07T20:32:14.6194313Z     
2025-05-07T20:32:14.6194550Z         if contiguous:
2025-05-07T20:32:14.6194863Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6195226Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6195580Z     
2025-05-07T20:32:14.6195860Z         if scale_ub is not None:
2025-05-07T20:32:14.6196473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6196943Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6197384Z             )
2025-05-07T20:32:14.6197659Z         else:
2025-05-07T20:32:14.6197950Z             scale_ub_tensor = None
2025-05-07T20:32:14.6198296Z     
2025-05-07T20:32:14.6198606Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6199052Z             op = silu_mul_quant
2025-05-07T20:32:14.6199393Z             if compiled:
2025-05-07T20:32:14.6199761Z                 op = torch.compile(op)
2025-05-07T20:32:14.6200204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6200577Z     
2025-05-07T20:32:14.6200841Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6201073Z 
2025-05-07T20:32:14.6201220Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6201646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6202132Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6202548Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6203510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6204506Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6205273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6206261Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6207238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6208032Z     kernel = self.compile(
2025-05-07T20:32:14.6208841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6210001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6210552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6210871Z 
2025-05-07T20:32:14.6211158Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee047c7f0>
2025-05-07T20:32:14.6212759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6214852Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ee08d40d0>}
2025-05-07T20:32:14.6216850Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6218238Z context = <triton._C.libtriton.ir.context object at 0x7f1f2121b670>
2025-05-07T20:32:14.6218682Z 
2025-05-07T20:32:14.6243216Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6243867Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6244384Z                            module_map=module_map)
2025-05-07T20:32:14.6244770Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6245147Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6245416Z E       ^
2025-05-07T20:32:14.6245915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6246405Z 
2025-05-07T20:32:14.6246862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6247437Z 
2025-05-07T20:32:14.6247541Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6247977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6248569Z     T=1,
2025-05-07T20:32:14.6248752Z     D=7168,
2025-05-07T20:32:14.6248946Z     scale_ub=None,
2025-05-07T20:32:14.6249160Z     contiguous=True,
2025-05-07T20:32:14.6249375Z     compiled=True,
2025-05-07T20:32:14.6249583Z )
2025-05-07T20:32:14.6249913Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6250416Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.6250698Z 
2025-05-07T20:32:14.6250776Z     @given(
2025-05-07T20:32:14.6251010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6251327Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6251644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6251993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6252333Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6252629Z     )
2025-05-07T20:32:14.6252997Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6253463Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6253702Z         self,
2025-05-07T20:32:14.6253894Z         T: int,
2025-05-07T20:32:14.6254088Z         D: int,
2025-05-07T20:32:14.6254298Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6254574Z         contiguous: bool,
2025-05-07T20:32:14.6254815Z         compiled: bool,
2025-05-07T20:32:14.6255030Z     ) -> None:
2025-05-07T20:32:14.6255248Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6255493Z     
2025-05-07T20:32:14.6255760Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6256119Z     
2025-05-07T20:32:14.6256316Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6256606Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6257040Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6257288Z         x0 = x[:, :D]
2025-05-07T20:32:14.6257499Z         x1 = x[:, D:]
2025-05-07T20:32:14.6257708Z     
2025-05-07T20:32:14.6257899Z         if contiguous:
2025-05-07T20:32:14.6258126Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6258389Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6258631Z     
2025-05-07T20:32:14.6258813Z         if scale_ub is not None:
2025-05-07T20:32:14.6259091Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6259434Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6259755Z             )
2025-05-07T20:32:14.6259936Z         else:
2025-05-07T20:32:14.6260145Z             scale_ub_tensor = None
2025-05-07T20:32:14.6260402Z     
2025-05-07T20:32:14.6260628Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6260950Z             op = silu_mul_quant
2025-05-07T20:32:14.6261209Z             if compiled:
2025-05-07T20:32:14.6261452Z                 op = torch.compile(op)
2025-05-07T20:32:14.6261759Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6262045Z     
2025-05-07T20:32:14.6262232Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.6262520Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.6262812Z     
2025-05-07T20:32:14.6263046Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6263388Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.6263690Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.6264012Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.6264379Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6264701Z     
2025-05-07T20:32:14.6264903Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.6265105Z 
2025-05-07T20:32:14.6265202Z moe/activation_test.py:126: 
2025-05-07T20:32:14.6265504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6265861Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.6266186Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6267124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.6267951Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.6268531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6269259Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6270194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.6270968Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6271787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.6272596Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6273389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.6274080Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6274728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.6275282Z     fn()
2025-05-07T20:32:14.6275822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.6276450Z     self.fn.run(
2025-05-07T20:32:14.6276992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6277600Z     kernel = self.compile(
2025-05-07T20:32:14.6278270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6278972Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6279380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6279634Z 
2025-05-07T20:32:14.6279849Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f1d1feb20>
2025-05-07T20:32:14.6281018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6282542Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f21dbaa60>}
2025-05-07T20:32:14.6284303Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6285425Z context = <triton._C.libtriton.ir.context object at 0x7f1ee22f5ef0>
2025-05-07T20:32:14.6285735Z 
2025-05-07T20:32:14.6285903Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6286452Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6286936Z                            module_map=module_map)
2025-05-07T20:32:14.6287313Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6287681Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6287950Z E       ^
2025-05-07T20:32:14.6288445Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6288943Z 
2025-05-07T20:32:14.6289400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6289966Z 
2025-05-07T20:32:14.6290093Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6290745Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6291173Z     T=4096,
2025-05-07T20:32:14.6291355Z     D=5120,
2025-05-07T20:32:14.6291539Z     scale_ub=None,
2025-05-07T20:32:14.6291756Z     contiguous=False,
2025-05-07T20:32:14.6291982Z     compiled=False,
2025-05-07T20:32:14.6292180Z )
2025-05-07T20:32:14.6292502Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6293021Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.6293309Z 
2025-05-07T20:32:14.6293391Z     @given(
2025-05-07T20:32:14.6293614Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6293934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6294253Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6294586Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6294930Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6295226Z     )
2025-05-07T20:32:14.6295578Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6296043Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6296291Z         self,
2025-05-07T20:32:14.6296485Z         T: int,
2025-05-07T20:32:14.6296672Z         D: int,
2025-05-07T20:32:14.6296888Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6297163Z         contiguous: bool,
2025-05-07T20:32:14.6297397Z         compiled: bool,
2025-05-07T20:32:14.6297620Z     ) -> None:
2025-05-07T20:32:14.6297837Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6298076Z     
2025-05-07T20:32:14.6298346Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6298871Z     
2025-05-07T20:32:14.6299059Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6299353Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6299673Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6299913Z         x0 = x[:, :D]
2025-05-07T20:32:14.6300130Z         x1 = x[:, D:]
2025-05-07T20:32:14.6300333Z     
2025-05-07T20:32:14.6300507Z         if contiguous:
2025-05-07T20:32:14.6300740Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6300998Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6301235Z     
2025-05-07T20:32:14.6301425Z         if scale_ub is not None:
2025-05-07T20:32:14.6301698Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6302042Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6302352Z             )
2025-05-07T20:32:14.6302542Z         else:
2025-05-07T20:32:14.6302751Z             scale_ub_tensor = None
2025-05-07T20:32:14.6302999Z     
2025-05-07T20:32:14.6303230Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6303557Z             op = silu_mul_quant
2025-05-07T20:32:14.6303804Z             if compiled:
2025-05-07T20:32:14.6304051Z                 op = torch.compile(op)
2025-05-07T20:32:14.6304358Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6304633Z     
2025-05-07T20:32:14.6304826Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6304993Z 
2025-05-07T20:32:14.6305096Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6305390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6305732Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6306015Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6306751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6307487Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6308055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6308792Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6309588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6310277Z     kernel = self.compile(
2025-05-07T20:32:14.6310852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6311555Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6311962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6312209Z 
2025-05-07T20:32:14.6312422Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee23a4550>
2025-05-07T20:32:14.6313592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6315106Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f21ae0700>}
2025-05-07T20:32:14.6316569Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6317670Z context = <triton._C.libtriton.ir.context object at 0x7f1ee02f5bb0>
2025-05-07T20:32:14.6317982Z 
2025-05-07T20:32:14.6318148Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6318701Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6319193Z                            module_map=module_map)
2025-05-07T20:32:14.6319567Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6320025Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6320296Z E       ^
2025-05-07T20:32:14.6320784Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6321280Z 
2025-05-07T20:32:14.6321728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6322290Z 
2025-05-07T20:32:14.6322392Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6322814Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6323243Z     T=4096,
2025-05-07T20:32:14.6323437Z     D=7168,
2025-05-07T20:32:14.6323625Z     scale_ub=None,
2025-05-07T20:32:14.6323842Z     contiguous=False,
2025-05-07T20:32:14.6324068Z     compiled=False,
2025-05-07T20:32:14.6324267Z )
2025-05-07T20:32:14.6324596Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6325124Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.6325412Z 
2025-05-07T20:32:14.6325489Z     @given(
2025-05-07T20:32:14.6325725Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6326045Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6326361Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6326692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6327031Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6327327Z     )
2025-05-07T20:32:14.6327680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6328155Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6328404Z         self,
2025-05-07T20:32:14.6328591Z         T: int,
2025-05-07T20:32:14.6328783Z         D: int,
2025-05-07T20:32:14.6329002Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6329266Z         contiguous: bool,
2025-05-07T20:32:14.6329513Z         compiled: bool,
2025-05-07T20:32:14.6329773Z     ) -> None:
2025-05-07T20:32:14.6329993Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6330236Z     
2025-05-07T20:32:14.6330596Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6330950Z     
2025-05-07T20:32:14.6331139Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6331435Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6331749Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6331981Z         x0 = x[:, :D]
2025-05-07T20:32:14.6332198Z         x1 = x[:, D:]
2025-05-07T20:32:14.6332406Z     
2025-05-07T20:32:14.6332583Z         if contiguous:
2025-05-07T20:32:14.6332818Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6333081Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6333317Z     
2025-05-07T20:32:14.6333507Z         if scale_ub is not None:
2025-05-07T20:32:14.6333785Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6334124Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6334443Z             )
2025-05-07T20:32:14.6334633Z         else:
2025-05-07T20:32:14.6334848Z             scale_ub_tensor = None
2025-05-07T20:32:14.6335100Z     
2025-05-07T20:32:14.6335328Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6335649Z             op = silu_mul_quant
2025-05-07T20:32:14.6335897Z             if compiled:
2025-05-07T20:32:14.6336148Z                 op = torch.compile(op)
2025-05-07T20:32:14.6336450Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6336728Z     
2025-05-07T20:32:14.6336920Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6337087Z 
2025-05-07T20:32:14.6337192Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6337489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6337834Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6338122Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6338942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6339692Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6340260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6340993Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6341695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6342262Z     kernel = self.compile(
2025-05-07T20:32:14.6342837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6343536Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6343943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6344196Z 
2025-05-07T20:32:14.6344408Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee227d340>
2025-05-07T20:32:14.6345582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6347088Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f1c6960d0>}
2025-05-07T20:32:14.6348549Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6349662Z context = <triton._C.libtriton.ir.context object at 0x7f1ee224d770>
2025-05-07T20:32:14.6350092Z 
2025-05-07T20:32:14.6350266Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6350812Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6351384Z                            module_map=module_map)
2025-05-07T20:32:14.6351765Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6352128Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6352385Z E       ^
2025-05-07T20:32:14.6352871Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6353365Z 
2025-05-07T20:32:14.6353812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6354366Z 
2025-05-07T20:32:14.6354475Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6354895Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6355317Z     T=128,
2025-05-07T20:32:14.6355498Z     D=7168,
2025-05-07T20:32:14.6355684Z     scale_ub=None,
2025-05-07T20:32:14.6355897Z     contiguous=False,
2025-05-07T20:32:14.6356121Z     compiled=True,
2025-05-07T20:32:14.6356324Z )
2025-05-07T20:32:14.6356642Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6357155Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.6357434Z 
2025-05-07T20:32:14.6357518Z     @given(
2025-05-07T20:32:14.6357739Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6358060Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6358374Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6358708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6359045Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6359349Z     )
2025-05-07T20:32:14.6359759Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6360308Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6360552Z         self,
2025-05-07T20:32:14.6360745Z         T: int,
2025-05-07T20:32:14.6360942Z         D: int,
2025-05-07T20:32:14.6361165Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6361441Z         contiguous: bool,
2025-05-07T20:32:14.6361678Z         compiled: bool,
2025-05-07T20:32:14.6361902Z     ) -> None:
2025-05-07T20:32:14.6362117Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6362355Z     
2025-05-07T20:32:14.6362630Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6362988Z     
2025-05-07T20:32:14.6363173Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6363466Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6363786Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6364018Z         x0 = x[:, :D]
2025-05-07T20:32:14.6364231Z         x1 = x[:, D:]
2025-05-07T20:32:14.6364441Z     
2025-05-07T20:32:14.6364622Z         if contiguous:
2025-05-07T20:32:14.6364849Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6365112Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6365355Z     
2025-05-07T20:32:14.6365540Z         if scale_ub is not None:
2025-05-07T20:32:14.6365815Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6366156Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6366465Z             )
2025-05-07T20:32:14.6366657Z         else:
2025-05-07T20:32:14.6366866Z             scale_ub_tensor = None
2025-05-07T20:32:14.6367112Z     
2025-05-07T20:32:14.6367338Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6367661Z             op = silu_mul_quant
2025-05-07T20:32:14.6367907Z             if compiled:
2025-05-07T20:32:14.6368162Z                 op = torch.compile(op)
2025-05-07T20:32:14.6368463Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6368738Z     
2025-05-07T20:32:14.6368926Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.6369221Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.6369513Z     
2025-05-07T20:32:14.6369749Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6370208Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.6370511Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.6370828Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.6371198Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6371514Z     
2025-05-07T20:32:14.6371710Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.6371916Z 
2025-05-07T20:32:14.6372014Z moe/activation_test.py:126: 
2025-05-07T20:32:14.6372317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6372658Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.6372993Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6373842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.6374660Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.6375238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6375970Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6376707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.6377480Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6378286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.6379089Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6379956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.6380645Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6381279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.6381836Z     fn()
2025-05-07T20:32:14.6382375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.6383279Z     self.fn.run(
2025-05-07T20:32:14.6383775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6384342Z     kernel = self.compile(
2025-05-07T20:32:14.6384905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6385602Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6386023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6386266Z 
2025-05-07T20:32:14.6386488Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee2275940>
2025-05-07T20:32:14.6387650Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6389156Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f1d2114c0>}
2025-05-07T20:32:14.6390765Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6391872Z context = <triton._C.libtriton.ir.context object at 0x7f1d7dd4deb0>
2025-05-07T20:32:14.6392182Z 
2025-05-07T20:32:14.6392361Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6393057Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6393553Z                            module_map=module_map)
2025-05-07T20:32:14.6393931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6394290Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6394562Z E       ^
2025-05-07T20:32:14.6395053Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6395542Z 
2025-05-07T20:32:14.6395993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6396549Z 
2025-05-07T20:32:14.6396649Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6397078Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6397496Z     T=128,
2025-05-07T20:32:14.6397675Z     D=7168,
2025-05-07T20:32:14.6397866Z     scale_ub=None,
2025-05-07T20:32:14.6398085Z     contiguous=False,
2025-05-07T20:32:14.6398309Z     compiled=False,
2025-05-07T20:32:14.6398506Z )
2025-05-07T20:32:14.6398831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6399345Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.6399628Z 
2025-05-07T20:32:14.6399704Z     @given(
2025-05-07T20:32:14.6399945Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6407617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6407954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6408305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6408650Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6409125Z     )
2025-05-07T20:32:14.6409499Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6409969Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6410229Z         self,
2025-05-07T20:32:14.6410432Z         T: int,
2025-05-07T20:32:14.6410631Z         D: int,
2025-05-07T20:32:14.6410857Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6411144Z         contiguous: bool,
2025-05-07T20:32:14.6411386Z         compiled: bool,
2025-05-07T20:32:14.6411618Z     ) -> None:
2025-05-07T20:32:14.6411844Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6412088Z     
2025-05-07T20:32:14.6412373Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6412743Z     
2025-05-07T20:32:14.6412938Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6413237Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6413561Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6413804Z         x0 = x[:, :D]
2025-05-07T20:32:14.6414026Z         x1 = x[:, D:]
2025-05-07T20:32:14.6414242Z     
2025-05-07T20:32:14.6414429Z         if contiguous:
2025-05-07T20:32:14.6414658Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6414926Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6415171Z     
2025-05-07T20:32:14.6415355Z         if scale_ub is not None:
2025-05-07T20:32:14.6415636Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6415984Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6416300Z             )
2025-05-07T20:32:14.6416497Z         else:
2025-05-07T20:32:14.6416714Z             scale_ub_tensor = None
2025-05-07T20:32:14.6416968Z     
2025-05-07T20:32:14.6417204Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6417535Z             op = silu_mul_quant
2025-05-07T20:32:14.6417786Z             if compiled:
2025-05-07T20:32:14.6418041Z                 op = torch.compile(op)
2025-05-07T20:32:14.6418348Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6418631Z     
2025-05-07T20:32:14.6418827Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6419000Z 
2025-05-07T20:32:14.6419105Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6419496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6419894Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6420184Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6420925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6421669Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6422241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6422977Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6423691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6424258Z     kernel = self.compile(
2025-05-07T20:32:14.6424838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6425538Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6425948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6426200Z 
2025-05-07T20:32:14.6426415Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7dc8ac10>
2025-05-07T20:32:14.6427588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6429098Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ee0208dc0>}
2025-05-07T20:32:14.6430806Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6431911Z context = <triton._C.libtriton.ir.context object at 0x7f1ee01eab70>
2025-05-07T20:32:14.6432222Z 
2025-05-07T20:32:14.6432391Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6432943Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6433433Z                            module_map=module_map)
2025-05-07T20:32:14.6433806Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6434168Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6434434Z E       ^
2025-05-07T20:32:14.6434920Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6435419Z 
2025-05-07T20:32:14.6435868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6436432Z 
2025-05-07T20:32:14.6436536Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6436966Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6437383Z     T=4096,
2025-05-07T20:32:14.6437568Z     D=5120,
2025-05-07T20:32:14.6437760Z     scale_ub=1200.0,
2025-05-07T20:32:14.6437979Z     contiguous=True,
2025-05-07T20:32:14.6438205Z     compiled=False,
2025-05-07T20:32:14.6438411Z )
2025-05-07T20:32:14.6438730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6439253Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.6439543Z 
2025-05-07T20:32:14.6439629Z     @given(
2025-05-07T20:32:14.6439855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6440187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6440505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6440939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6441273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6441574Z     )
2025-05-07T20:32:14.6441936Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6442397Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6442640Z         self,
2025-05-07T20:32:14.6442837Z         T: int,
2025-05-07T20:32:14.6443027Z         D: int,
2025-05-07T20:32:14.6443245Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6443522Z         contiguous: bool,
2025-05-07T20:32:14.6443758Z         compiled: bool,
2025-05-07T20:32:14.6443980Z     ) -> None:
2025-05-07T20:32:14.6444194Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6444438Z     
2025-05-07T20:32:14.6444714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6445077Z     
2025-05-07T20:32:14.6445268Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6445562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6445880Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6446127Z         x0 = x[:, :D]
2025-05-07T20:32:14.6446343Z         x1 = x[:, D:]
2025-05-07T20:32:14.6446554Z     
2025-05-07T20:32:14.6446740Z         if contiguous:
2025-05-07T20:32:14.6446967Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6447236Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6447478Z     
2025-05-07T20:32:14.6447664Z         if scale_ub is not None:
2025-05-07T20:32:14.6447945Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6448292Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6448605Z             )
2025-05-07T20:32:14.6448795Z         else:
2025-05-07T20:32:14.6449010Z             scale_ub_tensor = None
2025-05-07T20:32:14.6449354Z     
2025-05-07T20:32:14.6449591Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6449918Z             op = silu_mul_quant
2025-05-07T20:32:14.6450166Z             if compiled:
2025-05-07T20:32:14.6450421Z                 op = torch.compile(op)
2025-05-07T20:32:14.6450722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6451004Z     
2025-05-07T20:32:14.6451189Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6451363Z 
2025-05-07T20:32:14.6451460Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6451766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6452107Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6452397Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6453138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6453885Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6454457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6455193Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6455904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6456465Z     kernel = self.compile(
2025-05-07T20:32:14.6457036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6457738Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6458160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6458402Z 
2025-05-07T20:32:14.6458619Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7de089d0>
2025-05-07T20:32:14.6459793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6463267Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ee1fe4dc0>}
2025-05-07T20:32:14.6464743Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6465846Z context = <triton._C.libtriton.ir.context object at 0x7f1d7dd32cb0>
2025-05-07T20:32:14.6466158Z 
2025-05-07T20:32:14.6466328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6466877Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6467370Z                            module_map=module_map)
2025-05-07T20:32:14.6467747Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6468112Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6468380Z E       ^
2025-05-07T20:32:14.6468871Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6469364Z 
2025-05-07T20:32:14.6469883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6470449Z 
2025-05-07T20:32:14.6470553Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6470983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6471399Z     T=1,
2025-05-07T20:32:14.6471582Z     D=5120,
2025-05-07T20:32:14.6471778Z     scale_ub=None,
2025-05-07T20:32:14.6471991Z     contiguous=True,
2025-05-07T20:32:14.6472214Z     compiled=True,
2025-05-07T20:32:14.6472511Z )
2025-05-07T20:32:14.6472833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6473338Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.6473614Z 
2025-05-07T20:32:14.6473694Z     @given(
2025-05-07T20:32:14.6473923Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6474240Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6474560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6474904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6475237Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6475533Z     )
2025-05-07T20:32:14.6475898Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6476365Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6476607Z         self,
2025-05-07T20:32:14.6476804Z         T: int,
2025-05-07T20:32:14.6477000Z         D: int,
2025-05-07T20:32:14.6477220Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6477313Z         contiguous: bool,
2025-05-07T20:32:14.6477401Z         compiled: bool,
2025-05-07T20:32:14.6477478Z     ) -> None:
2025-05-07T20:32:14.6477581Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6477651Z     
2025-05-07T20:32:14.6477823Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6477904Z     
2025-05-07T20:32:14.6477994Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6478117Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6478215Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6478296Z         x0 = x[:, :D]
2025-05-07T20:32:14.6478374Z         x1 = x[:, D:]
2025-05-07T20:32:14.6478449Z     
2025-05-07T20:32:14.6478531Z         if contiguous:
2025-05-07T20:32:14.6478627Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6478714Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6478787Z     
2025-05-07T20:32:14.6478881Z         if scale_ub is not None:
2025-05-07T20:32:14.6478988Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6479123Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6479206Z             )
2025-05-07T20:32:14.6479363Z         else:
2025-05-07T20:32:14.6479459Z             scale_ub_tensor = None
2025-05-07T20:32:14.6479540Z     
2025-05-07T20:32:14.6479671Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6479761Z             op = silu_mul_quant
2025-05-07T20:32:14.6479853Z             if compiled:
2025-05-07T20:32:14.6479950Z                 op = torch.compile(op)
2025-05-07T20:32:14.6480059Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6480127Z     
2025-05-07T20:32:14.6480218Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.6480347Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.6480420Z     
2025-05-07T20:32:14.6480555Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6480662Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.6480778Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.6480900Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.6481047Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6481126Z     
2025-05-07T20:32:14.6481224Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.6481229Z 
2025-05-07T20:32:14.6481333Z moe/activation_test.py:126: 
2025-05-07T20:32:14.6481464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6481568Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.6481713Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6482321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.6482421Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.6483171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6483449Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6483850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.6484119Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6484548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.6484821Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6485223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.6485400Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6485772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.6485845Z     fn()
2025-05-07T20:32:14.6486285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.6486367Z     self.fn.run(
2025-05-07T20:32:14.6486728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6486826Z     kernel = self.compile(
2025-05-07T20:32:14.6487234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6487418Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6487548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6487553Z 
2025-05-07T20:32:14.6487765Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee1fedc70>
2025-05-07T20:32:14.6488774Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6489411Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ee021f700>}
2025-05-07T20:32:14.6490360Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6490575Z context = <triton._C.libtriton.ir.context object at 0x7f1d7d8e18b0>
2025-05-07T20:32:14.6490580Z 
2025-05-07T20:32:14.6490769Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6491083Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6491199Z                            module_map=module_map)
2025-05-07T20:32:14.6491385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6491497Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6491575Z E       ^
2025-05-07T20:32:14.6492014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6492019Z 
2025-05-07T20:32:14.6492525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6492530Z 
2025-05-07T20:32:14.6492647Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6492906Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6492984Z     T=2048,
2025-05-07T20:32:14.6493066Z     D=5120,
2025-05-07T20:32:14.6493148Z     scale_ub=None,
2025-05-07T20:32:14.6493353Z     contiguous=True,
2025-05-07T20:32:14.6493443Z     compiled=True,
2025-05-07T20:32:14.6493518Z )
2025-05-07T20:32:14.6493770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6493973Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.6493978Z 
2025-05-07T20:32:14.6494055Z     @given(
2025-05-07T20:32:14.6494189Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6494291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6494414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6494545Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6494664Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6494740Z     )
2025-05-07T20:32:14.6495040Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6495137Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6495219Z         self,
2025-05-07T20:32:14.6495302Z         T: int,
2025-05-07T20:32:14.6495380Z         D: int,
2025-05-07T20:32:14.6495488Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6495580Z         contiguous: bool,
2025-05-07T20:32:14.6495673Z         compiled: bool,
2025-05-07T20:32:14.6495758Z     ) -> None:
2025-05-07T20:32:14.6495857Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6495931Z     
2025-05-07T20:32:14.6496124Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6496198Z     
2025-05-07T20:32:14.6496292Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6496437Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6496527Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6496608Z         x0 = x[:, :D]
2025-05-07T20:32:14.6496693Z         x1 = x[:, D:]
2025-05-07T20:32:14.6496766Z     
2025-05-07T20:32:14.6496854Z         if contiguous:
2025-05-07T20:32:14.6496951Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6497042Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6497132Z     
2025-05-07T20:32:14.6497230Z         if scale_ub is not None:
2025-05-07T20:32:14.6497339Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6497574Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6497647Z             )
2025-05-07T20:32:14.6497720Z         else:
2025-05-07T20:32:14.6497818Z             scale_ub_tensor = None
2025-05-07T20:32:14.6497893Z     
2025-05-07T20:32:14.6498024Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6498119Z             op = silu_mul_quant
2025-05-07T20:32:14.6498202Z             if compiled:
2025-05-07T20:32:14.6498307Z                 op = torch.compile(op)
2025-05-07T20:32:14.6498412Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6498484Z     
2025-05-07T20:32:14.6498578Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.6498698Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.6498775Z     
2025-05-07T20:32:14.6498915Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6499016Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.6499114Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.6499254Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.6499394Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6499467Z     
2025-05-07T20:32:14.6499575Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.6499579Z 
2025-05-07T20:32:14.6499676Z moe/activation_test.py:126: 
2025-05-07T20:32:14.6499811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6499918Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.6500054Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6500670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.6500877Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.6501262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6501504Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6501897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.6502170Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6502599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.6502865Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6503271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.6503447Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6503816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.6503898Z     fn()
2025-05-07T20:32:14.6504331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.6504419Z     self.fn.run(
2025-05-07T20:32:14.6504778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6504868Z     kernel = self.compile(
2025-05-07T20:32:14.6505282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6505461Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6505596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6505605Z 
2025-05-07T20:32:14.6505816Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee21b80d0>
2025-05-07T20:32:14.6506769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6507323Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7dcbe790>}
2025-05-07T20:32:14.6508135Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6508337Z context = <triton._C.libtriton.ir.context object at 0x7f1d7d65b6f0>
2025-05-07T20:32:14.6508341Z 
2025-05-07T20:32:14.6508507Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6508793Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6508907Z                            module_map=module_map)
2025-05-07T20:32:14.6509070Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6509175Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6509252Z E       ^
2025-05-07T20:32:14.6509635Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6509639Z 
2025-05-07T20:32:14.6510185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6510190Z 
2025-05-07T20:32:14.6510290Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6510529Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6510606Z     T=128,
2025-05-07T20:32:14.6510771Z     D=5120,
2025-05-07T20:32:14.6510860Z     scale_ub=None,
2025-05-07T20:32:14.6510944Z     contiguous=True,
2025-05-07T20:32:14.6511026Z     compiled=True,
2025-05-07T20:32:14.6511101Z )
2025-05-07T20:32:14.6511333Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6511505Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.6511519Z 
2025-05-07T20:32:14.6511596Z     @given(
2025-05-07T20:32:14.6511715Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6511816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6511928Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6512043Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6512158Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6512232Z     )
2025-05-07T20:32:14.6512493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6512596Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6512674Z         self,
2025-05-07T20:32:14.6512750Z         T: int,
2025-05-07T20:32:14.6512827Z         D: int,
2025-05-07T20:32:14.6512926Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6513016Z         contiguous: bool,
2025-05-07T20:32:14.6513096Z         compiled: bool,
2025-05-07T20:32:14.6513174Z     ) -> None:
2025-05-07T20:32:14.6513272Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6513341Z     
2025-05-07T20:32:14.6513512Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6513585Z     
2025-05-07T20:32:14.6513674Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6513794Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6513886Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6513966Z         x0 = x[:, :D]
2025-05-07T20:32:14.6514044Z         x1 = x[:, D:]
2025-05-07T20:32:14.6514119Z     
2025-05-07T20:32:14.6514201Z         if contiguous:
2025-05-07T20:32:14.6514299Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6514386Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6514460Z     
2025-05-07T20:32:14.6514553Z         if scale_ub is not None:
2025-05-07T20:32:14.6514743Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6514880Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6514958Z             )
2025-05-07T20:32:14.6515032Z         else:
2025-05-07T20:32:14.6515125Z             scale_ub_tensor = None
2025-05-07T20:32:14.6515202Z     
2025-05-07T20:32:14.6515332Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6515420Z             op = silu_mul_quant
2025-05-07T20:32:14.6515509Z             if compiled:
2025-05-07T20:32:14.6515607Z                 op = torch.compile(op)
2025-05-07T20:32:14.6515720Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6515789Z     
2025-05-07T20:32:14.6515878Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.6516008Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.6516081Z     
2025-05-07T20:32:14.6516217Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6516328Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.6516426Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.6516549Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.6516696Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6516766Z     
2025-05-07T20:32:14.6516868Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.6516879Z 
2025-05-07T20:32:14.6516975Z moe/activation_test.py:126: 
2025-05-07T20:32:14.6517106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6517214Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.6517348Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6517956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.6518222Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.6518612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6518850Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6519243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.6519510Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6519944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.6520209Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6520617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.6520792Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6521158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.6521242Z     fn()
2025-05-07T20:32:14.6521672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.6521753Z     self.fn.run(
2025-05-07T20:32:14.6522114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6522206Z     kernel = self.compile(
2025-05-07T20:32:14.6522614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6522797Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6522930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6522935Z 
2025-05-07T20:32:14.6523150Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ee0367250>
2025-05-07T20:32:14.6524083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6524638Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7db5ea60>}
2025-05-07T20:32:14.6525448Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6525644Z context = <triton._C.libtriton.ir.context object at 0x7f1d7d0b5cf0>
2025-05-07T20:32:14.6525652Z 
2025-05-07T20:32:14.6525827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6526106Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6526221Z                            module_map=module_map)
2025-05-07T20:32:14.6526383Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6526484Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6526568Z E       ^
2025-05-07T20:32:14.6526951Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6526956Z 
2025-05-07T20:32:14.6527400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6527409Z 
2025-05-07T20:32:14.6527508Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6527739Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6527899Z     T=4096,
2025-05-07T20:32:14.6527973Z     D=5120,
2025-05-07T20:32:14.6528054Z     scale_ub=None,
2025-05-07T20:32:14.6528145Z     contiguous=True,
2025-05-07T20:32:14.6528232Z     compiled=True,
2025-05-07T20:32:14.6528302Z )
2025-05-07T20:32:14.6528532Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6528708Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.6528713Z 
2025-05-07T20:32:14.6528794Z     @given(
2025-05-07T20:32:14.6528911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6529010Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6529130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6529245Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6529357Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6529435Z     )
2025-05-07T20:32:14.6529700Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6529793Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6529871Z         self,
2025-05-07T20:32:14.6529953Z         T: int,
2025-05-07T20:32:14.6530026Z         D: int,
2025-05-07T20:32:14.6530129Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6530216Z         contiguous: bool,
2025-05-07T20:32:14.6530301Z         compiled: bool,
2025-05-07T20:32:14.6530374Z     ) -> None:
2025-05-07T20:32:14.6530466Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6530539Z     
2025-05-07T20:32:14.6530712Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6530783Z     
2025-05-07T20:32:14.6530876Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6530999Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6531087Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6531178Z         x0 = x[:, :D]
2025-05-07T20:32:14.6531263Z         x1 = x[:, D:]
2025-05-07T20:32:14.6531334Z     
2025-05-07T20:32:14.6531420Z         if contiguous:
2025-05-07T20:32:14.6531513Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6531599Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6531756Z     
2025-05-07T20:32:14.6531849Z         if scale_ub is not None:
2025-05-07T20:32:14.6531958Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6532090Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6532163Z             )
2025-05-07T20:32:14.6532243Z         else:
2025-05-07T20:32:14.6532334Z             scale_ub_tensor = None
2025-05-07T20:32:14.6532402Z     
2025-05-07T20:32:14.6532534Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6532623Z             op = silu_mul_quant
2025-05-07T20:32:14.6532705Z             if compiled:
2025-05-07T20:32:14.6532813Z                 op = torch.compile(op)
2025-05-07T20:32:14.6532917Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6532989Z     
2025-05-07T20:32:14.6533083Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.6533201Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.6533279Z     
2025-05-07T20:32:14.6533417Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6533517Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.6533621Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.6533739Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.6533879Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6533956Z     
2025-05-07T20:32:14.6534052Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.6534057Z 
2025-05-07T20:32:14.6534153Z moe/activation_test.py:126: 
2025-05-07T20:32:14.6534289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6534391Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.6534638Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6535252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.6535351Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.6535740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6535971Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6536368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.6536634Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6537063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.6537333Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6537740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.6537914Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6538288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.6538365Z     fn()
2025-05-07T20:32:14.6538797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.6538878Z     self.fn.run(
2025-05-07T20:32:14.6539238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6539338Z     kernel = self.compile(
2025-05-07T20:32:14.6539744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6539930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6540059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6540063Z 
2025-05-07T20:32:14.6540356Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7d11e490>
2025-05-07T20:32:14.6541212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6541758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7d2f1700>}
2025-05-07T20:32:14.6542576Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6542772Z context = <triton._C.libtriton.ir.context object at 0x7f1d7ca9eb70>
2025-05-07T20:32:14.6542777Z 
2025-05-07T20:32:14.6542947Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6543226Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6543333Z                            module_map=module_map)
2025-05-07T20:32:14.6543499Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6543598Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6543673Z E       ^
2025-05-07T20:32:14.6544055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6544060Z 
2025-05-07T20:32:14.6544508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6544513Z 
2025-05-07T20:32:14.6544701Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6544931Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6545007Z     T=16384,
2025-05-07T20:32:14.6545092Z     D=5120,
2025-05-07T20:32:14.6545171Z     scale_ub=None,
2025-05-07T20:32:14.6545255Z     contiguous=True,
2025-05-07T20:32:14.6545341Z     compiled=True,
2025-05-07T20:32:14.6545412Z )
2025-05-07T20:32:14.6545637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6545816Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.6545821Z 
2025-05-07T20:32:14.6545897Z     @given(
2025-05-07T20:32:14.6546016Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6546124Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6546239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6546359Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6546477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6546548Z     )
2025-05-07T20:32:14.6546808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6546903Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6546981Z         self,
2025-05-07T20:32:14.6547058Z         T: int,
2025-05-07T20:32:14.6547132Z         D: int,
2025-05-07T20:32:14.6547227Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6547317Z         contiguous: bool,
2025-05-07T20:32:14.6547401Z         compiled: bool,
2025-05-07T20:32:14.6547480Z     ) -> None:
2025-05-07T20:32:14.6547573Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6547644Z     
2025-05-07T20:32:14.6547823Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6547894Z     
2025-05-07T20:32:14.6547984Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6548113Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6548206Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6548286Z         x0 = x[:, :D]
2025-05-07T20:32:14.6552217Z         x1 = x[:, D:]
2025-05-07T20:32:14.6552305Z     
2025-05-07T20:32:14.6552400Z         if contiguous:
2025-05-07T20:32:14.6552606Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6552702Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6552775Z     
2025-05-07T20:32:14.6552866Z         if scale_ub is not None:
2025-05-07T20:32:14.6552978Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6553115Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6553187Z             )
2025-05-07T20:32:14.6553266Z         else:
2025-05-07T20:32:14.6553360Z             scale_ub_tensor = None
2025-05-07T20:32:14.6553429Z     
2025-05-07T20:32:14.6553565Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6553657Z             op = silu_mul_quant
2025-05-07T20:32:14.6553745Z             if compiled:
2025-05-07T20:32:14.6553850Z                 op = torch.compile(op)
2025-05-07T20:32:14.6553959Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6554039Z     
2025-05-07T20:32:14.6554132Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.6554258Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.6554335Z     
2025-05-07T20:32:14.6554475Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6554576Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.6554680Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.6554803Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.6554942Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6555016Z     
2025-05-07T20:32:14.6555116Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.6555121Z 
2025-05-07T20:32:14.6555220Z moe/activation_test.py:126: 
2025-05-07T20:32:14.6555355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6555542Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.6555682Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6556306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.6556407Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.6556801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6557034Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6557431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.6557697Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6558126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.6558399Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6558804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.6558981Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6559343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.6559421Z     fn()
2025-05-07T20:32:14.6559855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.6559937Z     self.fn.run(
2025-05-07T20:32:14.6560293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6560392Z     kernel = self.compile(
2025-05-07T20:32:14.6560798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6560982Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6561187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6561193Z 
2025-05-07T20:32:14.6561404Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7d020700>
2025-05-07T20:32:14.6562256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6562802Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7db55700>}
2025-05-07T20:32:14.6563623Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6563823Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c4580b0>
2025-05-07T20:32:14.6563831Z 
2025-05-07T20:32:14.6564001Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6564283Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6564386Z                            module_map=module_map)
2025-05-07T20:32:14.6564554Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6564657Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6564738Z E       ^
2025-05-07T20:32:14.6565124Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6565129Z 
2025-05-07T20:32:14.6565574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6565656Z 
2025-05-07T20:32:14.6565762Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6565999Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6566077Z     T=1,
2025-05-07T20:32:14.6566157Z     D=5120,
2025-05-07T20:32:14.6566236Z     scale_ub=1200.0,
2025-05-07T20:32:14.6566316Z     contiguous=True,
2025-05-07T20:32:14.6566399Z     compiled=True,
2025-05-07T20:32:14.6566469Z )
2025-05-07T20:32:14.6566692Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6566868Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.6566873Z 
2025-05-07T20:32:14.6566948Z     @given(
2025-05-07T20:32:14.6567068Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6567163Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6567278Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6567396Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6567511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6567585Z     )
2025-05-07T20:32:14.6567849Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6567940Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6568013Z         self,
2025-05-07T20:32:14.6568092Z         T: int,
2025-05-07T20:32:14.6568167Z         D: int,
2025-05-07T20:32:14.6568263Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6568350Z         contiguous: bool,
2025-05-07T20:32:14.6568433Z         compiled: bool,
2025-05-07T20:32:14.6568507Z     ) -> None:
2025-05-07T20:32:14.6568603Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6568675Z     
2025-05-07T20:32:14.6568847Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6568923Z     
2025-05-07T20:32:14.6569015Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6569150Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6569236Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6569314Z         x0 = x[:, :D]
2025-05-07T20:32:14.6569399Z         x1 = x[:, D:]
2025-05-07T20:32:14.6569577Z     
2025-05-07T20:32:14.6569660Z         if contiguous:
2025-05-07T20:32:14.6569755Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6569842Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6569914Z     
2025-05-07T20:32:14.6570004Z         if scale_ub is not None:
2025-05-07T20:32:14.6570106Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6570243Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6570319Z             )
2025-05-07T20:32:14.6570392Z         else:
2025-05-07T20:32:14.6570487Z             scale_ub_tensor = None
2025-05-07T20:32:14.6570559Z     
2025-05-07T20:32:14.6570685Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6570779Z             op = silu_mul_quant
2025-05-07T20:32:14.6570867Z             if compiled:
2025-05-07T20:32:14.6570965Z                 op = torch.compile(op)
2025-05-07T20:32:14.6571075Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6571154Z     
2025-05-07T20:32:14.6571243Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6571248Z 
2025-05-07T20:32:14.6571346Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6571476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6571578Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6571674Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6572067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6572160Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6572694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6572872Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6573254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6573492Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6573855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6573947Z     kernel = self.compile(
2025-05-07T20:32:14.6574356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6574535Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6574662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6574667Z 
2025-05-07T20:32:14.6574874Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7d1001f0>
2025-05-07T20:32:14.6575727Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6576281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7dabf5e0>}
2025-05-07T20:32:14.6577093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6577290Z context = <triton._C.libtriton.ir.context object at 0x7f1ceffb6030>
2025-05-07T20:32:14.6577295Z 
2025-05-07T20:32:14.6577465Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6577740Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6577849Z                            module_map=module_map)
2025-05-07T20:32:14.6578011Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6578106Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6578256Z E       ^
2025-05-07T20:32:14.6578643Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6578648Z 
2025-05-07T20:32:14.6579091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6579096Z 
2025-05-07T20:32:14.6579199Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6579428Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6579501Z     T=1,
2025-05-07T20:32:14.6579574Z     D=5120,
2025-05-07T20:32:14.6579655Z     scale_ub=None,
2025-05-07T20:32:14.6579740Z     contiguous=False,
2025-05-07T20:32:14.6579824Z     compiled=True,
2025-05-07T20:32:14.6579899Z )
2025-05-07T20:32:14.6580136Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6580305Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.6580313Z 
2025-05-07T20:32:14.6580388Z     @given(
2025-05-07T20:32:14.6580514Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6580610Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6580723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6580841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6580955Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6581027Z     )
2025-05-07T20:32:14.6581289Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6581380Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6581462Z         self,
2025-05-07T20:32:14.6581533Z         T: int,
2025-05-07T20:32:14.6581690Z         D: int,
2025-05-07T20:32:14.6581790Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6581876Z         contiguous: bool,
2025-05-07T20:32:14.6581958Z         compiled: bool,
2025-05-07T20:32:14.6582037Z     ) -> None:
2025-05-07T20:32:14.6582132Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6582203Z     
2025-05-07T20:32:14.6582377Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6582447Z     
2025-05-07T20:32:14.6582535Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6582659Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6582945Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6583079Z         x0 = x[:, :D]
2025-05-07T20:32:14.6583185Z         x1 = x[:, D:]
2025-05-07T20:32:14.6583258Z     
2025-05-07T20:32:14.6583347Z         if contiguous:
2025-05-07T20:32:14.6583434Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6583521Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6583597Z     
2025-05-07T20:32:14.6583691Z         if scale_ub is not None:
2025-05-07T20:32:14.6583793Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6583931Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6584007Z             )
2025-05-07T20:32:14.6584085Z         else:
2025-05-07T20:32:14.6584180Z             scale_ub_tensor = None
2025-05-07T20:32:14.6584250Z     
2025-05-07T20:32:14.6584382Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6584469Z             op = silu_mul_quant
2025-05-07T20:32:14.6584549Z             if compiled:
2025-05-07T20:32:14.6584646Z                 op = torch.compile(op)
2025-05-07T20:32:14.6584750Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6584821Z     
2025-05-07T20:32:14.6584914Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.6585034Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.6585104Z     
2025-05-07T20:32:14.6585244Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6585350Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.6585450Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.6585730Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.6585877Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6585953Z     
2025-05-07T20:32:14.6586052Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.6586056Z 
2025-05-07T20:32:14.6586152Z moe/activation_test.py:126: 
2025-05-07T20:32:14.6586283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6586385Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.6586518Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6587128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.6587224Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.6587615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6587848Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6588238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.6588511Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6588936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.6589201Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6589600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.6589822Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6590309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.6590385Z     fn()
2025-05-07T20:32:14.6590816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.6590902Z     self.fn.run(
2025-05-07T20:32:14.6591258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6591352Z     kernel = self.compile(
2025-05-07T20:32:14.6591755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6591932Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6592063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6592067Z 
2025-05-07T20:32:14.6592277Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c395640>
2025-05-07T20:32:14.6593141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6593689Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7cdb9af0>}
2025-05-07T20:32:14.6594496Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6594693Z context = <triton._C.libtriton.ir.context object at 0x7f1ceff3acb0>
2025-05-07T20:32:14.6594698Z 
2025-05-07T20:32:14.6594863Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6595141Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6595250Z                            module_map=module_map)
2025-05-07T20:32:14.6595491Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6595596Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6595672Z E       ^
2025-05-07T20:32:14.6596053Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6596060Z 
2025-05-07T20:32:14.6596502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6596507Z 
2025-05-07T20:32:14.6596606Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6596837Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6596908Z     T=1,
2025-05-07T20:32:14.6596979Z     D=5120,
2025-05-07T20:32:14.6597059Z     scale_ub=None,
2025-05-07T20:32:14.6597146Z     contiguous=True,
2025-05-07T20:32:14.6597228Z     compiled=False,
2025-05-07T20:32:14.6597307Z )
2025-05-07T20:32:14.6597534Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6597701Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.6597706Z 
2025-05-07T20:32:14.6597780Z     @given(
2025-05-07T20:32:14.6597896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6597997Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6598111Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6598225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6598341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6598414Z     )
2025-05-07T20:32:14.6598670Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6598763Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6598922Z         self,
2025-05-07T20:32:14.6599000Z         T: int,
2025-05-07T20:32:14.6599072Z         D: int,
2025-05-07T20:32:14.6599170Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6599261Z         contiguous: bool,
2025-05-07T20:32:14.6599346Z         compiled: bool,
2025-05-07T20:32:14.6599419Z     ) -> None:
2025-05-07T20:32:14.6599517Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6599591Z     
2025-05-07T20:32:14.6599760Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6599840Z     
2025-05-07T20:32:14.6599930Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6600051Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6600142Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6600219Z         x0 = x[:, :D]
2025-05-07T20:32:14.6600298Z         x1 = x[:, D:]
2025-05-07T20:32:14.6600368Z     
2025-05-07T20:32:14.6600451Z         if contiguous:
2025-05-07T20:32:14.6600545Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6600640Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6600712Z     
2025-05-07T20:32:14.6600803Z         if scale_ub is not None:
2025-05-07T20:32:14.6600905Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6601042Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6601127Z             )
2025-05-07T20:32:14.6601202Z         else:
2025-05-07T20:32:14.6601293Z             scale_ub_tensor = None
2025-05-07T20:32:14.6601373Z     
2025-05-07T20:32:14.6601501Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6601591Z             op = silu_mul_quant
2025-05-07T20:32:14.6601672Z             if compiled:
2025-05-07T20:32:14.6601771Z                 op = torch.compile(op)
2025-05-07T20:32:14.6601876Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6601945Z     
2025-05-07T20:32:14.6602031Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6602035Z 
2025-05-07T20:32:14.6602136Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6602271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6602367Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6602573Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6603113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6603213Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6603596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6603827Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6604198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6604290Z     kernel = self.compile(
2025-05-07T20:32:14.6604697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6604882Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6605014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6605018Z 
2025-05-07T20:32:14.6605231Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c7d05b0>
2025-05-07T20:32:14.6606074Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6606624Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7d2f18b0>}
2025-05-07T20:32:14.6607434Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6607705Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c6d5f70>
2025-05-07T20:32:14.6607710Z 
2025-05-07T20:32:14.6607889Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6608163Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6608274Z                            module_map=module_map)
2025-05-07T20:32:14.6608434Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6608530Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6608603Z E       ^
2025-05-07T20:32:14.6608981Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6608986Z 
2025-05-07T20:32:14.6609428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6609445Z 
2025-05-07T20:32:14.6609544Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6609772Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6609855Z     T=128,
2025-05-07T20:32:14.6609933Z     D=5120,
2025-05-07T20:32:14.6610010Z     scale_ub=None,
2025-05-07T20:32:14.6610094Z     contiguous=False,
2025-05-07T20:32:14.6610173Z     compiled=True,
2025-05-07T20:32:14.6610246Z )
2025-05-07T20:32:14.6610472Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6610644Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.6610649Z 
2025-05-07T20:32:14.6610723Z     @given(
2025-05-07T20:32:14.6610841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6610939Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6611062Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6611177Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6611294Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6611367Z     )
2025-05-07T20:32:14.6611706Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6611801Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6611883Z         self,
2025-05-07T20:32:14.6611957Z         T: int,
2025-05-07T20:32:14.6612030Z         D: int,
2025-05-07T20:32:14.6612128Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6612212Z         contiguous: bool,
2025-05-07T20:32:14.6612300Z         compiled: bool,
2025-05-07T20:32:14.6612374Z     ) -> None:
2025-05-07T20:32:14.6612463Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6612539Z     
2025-05-07T20:32:14.6612709Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6612782Z     
2025-05-07T20:32:14.6612878Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6613003Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6613092Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6613177Z         x0 = x[:, :D]
2025-05-07T20:32:14.6613256Z         x1 = x[:, D:]
2025-05-07T20:32:14.6613328Z     
2025-05-07T20:32:14.6613425Z         if contiguous:
2025-05-07T20:32:14.6613514Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6613601Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6613677Z     
2025-05-07T20:32:14.6613764Z         if scale_ub is not None:
2025-05-07T20:32:14.6613869Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6614005Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6614077Z             )
2025-05-07T20:32:14.6614154Z         else:
2025-05-07T20:32:14.6614247Z             scale_ub_tensor = None
2025-05-07T20:32:14.6614327Z     
2025-05-07T20:32:14.6614457Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6614554Z             op = silu_mul_quant
2025-05-07T20:32:14.6614634Z             if compiled:
2025-05-07T20:32:14.6614819Z                 op = torch.compile(op)
2025-05-07T20:32:14.6614925Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6614997Z     
2025-05-07T20:32:14.6615093Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6615098Z 
2025-05-07T20:32:14.6615192Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6615323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6615425Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6615522Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6615911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6616006Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6616539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6616634Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6617018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6617254Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6617619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6617708Z     kernel = self.compile(
2025-05-07T20:32:14.6618115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6618297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6618425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6618429Z 
2025-05-07T20:32:14.6618640Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7d936040>
2025-05-07T20:32:14.6619484Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6620124Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7cd1ee50>}
2025-05-07T20:32:14.6620936Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6621134Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c327070>
2025-05-07T20:32:14.6621138Z 
2025-05-07T20:32:14.6621309Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6621579Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6621691Z                            module_map=module_map)
2025-05-07T20:32:14.6621855Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6621952Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6622030Z E       ^
2025-05-07T20:32:14.6622416Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6622421Z 
2025-05-07T20:32:14.6622863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6622867Z 
2025-05-07T20:32:14.6622970Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6623198Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6623276Z     T=128,
2025-05-07T20:32:14.6623351Z     D=7168,
2025-05-07T20:32:14.6623429Z     scale_ub=1200.0,
2025-05-07T20:32:14.6623517Z     contiguous=False,
2025-05-07T20:32:14.6623598Z     compiled=False,
2025-05-07T20:32:14.6623668Z )
2025-05-07T20:32:14.6623976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6624151Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.6624156Z 
2025-05-07T20:32:14.6624235Z     @given(
2025-05-07T20:32:14.6624352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6624448Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6624570Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6624683Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6624794Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6624870Z     )
2025-05-07T20:32:14.6625125Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6625216Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6625296Z         self,
2025-05-07T20:32:14.6625371Z         T: int,
2025-05-07T20:32:14.6625444Z         D: int,
2025-05-07T20:32:14.6625550Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6625639Z         contiguous: bool,
2025-05-07T20:32:14.6625721Z         compiled: bool,
2025-05-07T20:32:14.6625800Z     ) -> None:
2025-05-07T20:32:14.6625894Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6625967Z     
2025-05-07T20:32:14.6626138Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6626209Z     
2025-05-07T20:32:14.6626304Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6626424Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6626511Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6626594Z         x0 = x[:, :D]
2025-05-07T20:32:14.6626674Z         x1 = x[:, D:]
2025-05-07T20:32:14.6626744Z     
2025-05-07T20:32:14.6626830Z         if contiguous:
2025-05-07T20:32:14.6626918Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6627006Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6627083Z     
2025-05-07T20:32:14.6627172Z         if scale_ub is not None:
2025-05-07T20:32:14.6627281Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6627414Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6627485Z             )
2025-05-07T20:32:14.6627649Z         else:
2025-05-07T20:32:14.6627744Z             scale_ub_tensor = None
2025-05-07T20:32:14.6627814Z     
2025-05-07T20:32:14.6627943Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6628031Z             op = silu_mul_quant
2025-05-07T20:32:14.6628111Z             if compiled:
2025-05-07T20:32:14.6628211Z                 op = torch.compile(op)
2025-05-07T20:32:14.6628315Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6628387Z     
2025-05-07T20:32:14.6628478Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6628482Z 
2025-05-07T20:32:14.6628576Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6628708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6628806Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6628907Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6629453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6629549Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6629989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6630224Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6630585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6630683Z     kernel = self.compile(
2025-05-07T20:32:14.6631090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6631267Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6631509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6631513Z 
2025-05-07T20:32:14.6631725Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c7e2e20>
2025-05-07T20:32:14.6632580Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6633122Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7cdb9c10>}
2025-05-07T20:32:14.6633934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6634132Z context = <triton._C.libtriton.ir.context object at 0x7f1cefe79cf0>
2025-05-07T20:32:14.6634140Z 
2025-05-07T20:32:14.6634312Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6634593Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6634697Z                            module_map=module_map)
2025-05-07T20:32:14.6634857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6634955Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6635029Z E       ^
2025-05-07T20:32:14.6635410Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6635415Z 
2025-05-07T20:32:14.6635857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6635862Z 
2025-05-07T20:32:14.6635962Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6636195Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6636275Z     T=128,
2025-05-07T20:32:14.6636346Z     D=5120,
2025-05-07T20:32:14.6636429Z     scale_ub=None,
2025-05-07T20:32:14.6636591Z     contiguous=False,
2025-05-07T20:32:14.6636680Z     compiled=False,
2025-05-07T20:32:14.6636748Z )
2025-05-07T20:32:14.6636973Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6637150Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.6637154Z 
2025-05-07T20:32:14.6637230Z     @given(
2025-05-07T20:32:14.6637344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6637443Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6637556Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6637669Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6637784Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6637861Z     )
2025-05-07T20:32:14.6638118Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6638211Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6638287Z         self,
2025-05-07T20:32:14.6638368Z         T: int,
2025-05-07T20:32:14.6638443Z         D: int,
2025-05-07T20:32:14.6638539Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6638630Z         contiguous: bool,
2025-05-07T20:32:14.6638712Z         compiled: bool,
2025-05-07T20:32:14.6638787Z     ) -> None:
2025-05-07T20:32:14.6638888Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6638962Z     
2025-05-07T20:32:14.6639131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6639209Z     
2025-05-07T20:32:14.6639300Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6639425Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6639511Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6639587Z         x0 = x[:, :D]
2025-05-07T20:32:14.6639755Z         x1 = x[:, D:]
2025-05-07T20:32:14.6639826Z     
2025-05-07T20:32:14.6639903Z         if contiguous:
2025-05-07T20:32:14.6639994Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6640085Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6640155Z     
2025-05-07T20:32:14.6640249Z         if scale_ub is not None:
2025-05-07T20:32:14.6640350Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6640486Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6640558Z             )
2025-05-07T20:32:14.6640631Z         else:
2025-05-07T20:32:14.6640724Z             scale_ub_tensor = None
2025-05-07T20:32:14.6640797Z     
2025-05-07T20:32:14.6640922Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6641012Z             op = silu_mul_quant
2025-05-07T20:32:14.6641092Z             if compiled:
2025-05-07T20:32:14.6641189Z                 op = torch.compile(op)
2025-05-07T20:32:14.6641294Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6641370Z     
2025-05-07T20:32:14.6641455Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6641459Z 
2025-05-07T20:32:14.6641554Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6641686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6641783Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6641882Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6642422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6642524Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6642901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6643130Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6643494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6643587Z     kernel = self.compile(
2025-05-07T20:32:14.6643998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6644258Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6644387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6644391Z 
2025-05-07T20:32:14.6644602Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c3d5640>
2025-05-07T20:32:14.6645444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6645990Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7cd8da60>}
2025-05-07T20:32:14.6646808Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6647003Z context = <triton._C.libtriton.ir.context object at 0x7f1ceff58870>
2025-05-07T20:32:14.6647008Z 
2025-05-07T20:32:14.6647178Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6647450Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6647559Z                            module_map=module_map)
2025-05-07T20:32:14.6647720Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6647818Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6647897Z E       ^
2025-05-07T20:32:14.6648275Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6648358Z 
2025-05-07T20:32:14.6648799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6648808Z 
2025-05-07T20:32:14.6648912Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6649143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6649221Z     T=128,
2025-05-07T20:32:14.6649298Z     D=5120,
2025-05-07T20:32:14.6649379Z     scale_ub=1200.0,
2025-05-07T20:32:14.6649466Z     contiguous=True,
2025-05-07T20:32:14.6649551Z     compiled=False,
2025-05-07T20:32:14.6649622Z )
2025-05-07T20:32:14.6649849Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6650022Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.6650026Z 
2025-05-07T20:32:14.6650106Z     @given(
2025-05-07T20:32:14.6650224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6650328Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6650445Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6650560Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6650676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6650752Z     )
2025-05-07T20:32:14.6651005Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6651094Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6651172Z         self,
2025-05-07T20:32:14.6651249Z         T: int,
2025-05-07T20:32:14.6651323Z         D: int,
2025-05-07T20:32:14.6651422Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6651507Z         contiguous: bool,
2025-05-07T20:32:14.6651592Z         compiled: bool,
2025-05-07T20:32:14.6651666Z     ) -> None:
2025-05-07T20:32:14.6651755Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6651829Z     
2025-05-07T20:32:14.6652000Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6652074Z     
2025-05-07T20:32:14.6652168Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6652289Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6653117Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6653204Z         x0 = x[:, :D]
2025-05-07T20:32:14.6653283Z         x1 = x[:, D:]
2025-05-07T20:32:14.6653355Z     
2025-05-07T20:32:14.6653439Z         if contiguous:
2025-05-07T20:32:14.6653528Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6653621Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6653691Z     
2025-05-07T20:32:14.6653779Z         if scale_ub is not None:
2025-05-07T20:32:14.6653885Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6654019Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6654089Z             )
2025-05-07T20:32:14.6654168Z         else:
2025-05-07T20:32:14.6654257Z             scale_ub_tensor = None
2025-05-07T20:32:14.6654332Z     
2025-05-07T20:32:14.6654462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6654549Z             op = silu_mul_quant
2025-05-07T20:32:14.6654630Z             if compiled:
2025-05-07T20:32:14.6654737Z                 op = torch.compile(op)
2025-05-07T20:32:14.6654840Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6654914Z     
2025-05-07T20:32:14.6655002Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6655007Z 
2025-05-07T20:32:14.6655098Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6655232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6655333Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6655428Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6655970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6656067Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6656590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6656821Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6657186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6657280Z     kernel = self.compile(
2025-05-07T20:32:14.6657686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6657862Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6657993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6657997Z 
2025-05-07T20:32:14.6658206Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ceff53220>
2025-05-07T20:32:14.6659052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6659604Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7cd8f550>}
2025-05-07T20:32:14.6660415Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6660608Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c2c51f0>
2025-05-07T20:32:14.6660613Z 
2025-05-07T20:32:14.6660778Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6661053Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6661159Z                            module_map=module_map)
2025-05-07T20:32:14.6661325Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6661431Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6661503Z E       ^
2025-05-07T20:32:14.6661965Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6661970Z 
2025-05-07T20:32:14.6662414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6662418Z 
2025-05-07T20:32:14.6662518Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6662752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6662824Z     T=1,
2025-05-07T20:32:14.6662904Z     D=7168,
2025-05-07T20:32:14.6662985Z     scale_ub=1200.0,
2025-05-07T20:32:14.6663067Z     contiguous=True,
2025-05-07T20:32:14.6663150Z     compiled=True,
2025-05-07T20:32:14.6663221Z )
2025-05-07T20:32:14.6663451Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6663621Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.6663625Z 
2025-05-07T20:32:14.6663704Z     @given(
2025-05-07T20:32:14.6663821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6663920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6664031Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6664146Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6664254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6664324Z     )
2025-05-07T20:32:14.6664581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6664672Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6664747Z         self,
2025-05-07T20:32:14.6664823Z         T: int,
2025-05-07T20:32:14.6664895Z         D: int,
2025-05-07T20:32:14.6664990Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6665186Z         contiguous: bool,
2025-05-07T20:32:14.6665271Z         compiled: bool,
2025-05-07T20:32:14.6665347Z     ) -> None:
2025-05-07T20:32:14.6665444Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6665519Z     
2025-05-07T20:32:14.6665688Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6665763Z     
2025-05-07T20:32:14.6665851Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6665975Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6666062Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6666140Z         x0 = x[:, :D]
2025-05-07T20:32:14.6666220Z         x1 = x[:, D:]
2025-05-07T20:32:14.6666289Z     
2025-05-07T20:32:14.6666369Z         if contiguous:
2025-05-07T20:32:14.6666463Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6666551Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6666623Z     
2025-05-07T20:32:14.6666717Z         if scale_ub is not None:
2025-05-07T20:32:14.6666824Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6666955Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6667032Z             )
2025-05-07T20:32:14.6667110Z         else:
2025-05-07T20:32:14.6667204Z             scale_ub_tensor = None
2025-05-07T20:32:14.6667274Z     
2025-05-07T20:32:14.6667402Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6667493Z             op = silu_mul_quant
2025-05-07T20:32:14.6667577Z             if compiled:
2025-05-07T20:32:14.6667675Z                 op = torch.compile(op)
2025-05-07T20:32:14.6667781Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6667846Z     
2025-05-07T20:32:14.6667934Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6667939Z 
2025-05-07T20:32:14.6668037Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6668166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6668271Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6668375Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6668765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6668945Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6669482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6669579Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6670020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6670251Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6670614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6670702Z     kernel = self.compile(
2025-05-07T20:32:14.6671108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6671297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6671432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6671437Z 
2025-05-07T20:32:14.6671646Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c2faa00>
2025-05-07T20:32:14.6672492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6673034Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7cd8d160>}
2025-05-07T20:32:14.6673842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6674121Z context = <triton._C.libtriton.ir.context object at 0x7f1cefdffcb0>
2025-05-07T20:32:14.6674130Z 
2025-05-07T20:32:14.6674303Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6674578Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6674683Z                            module_map=module_map)
2025-05-07T20:32:14.6674849Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6674945Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6675018Z E       ^
2025-05-07T20:32:14.6675395Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6675400Z 
2025-05-07T20:32:14.6675844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6675853Z 
2025-05-07T20:32:14.6675956Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6676189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6676264Z     T=1,
2025-05-07T20:32:14.6676342Z     D=7168,
2025-05-07T20:32:14.6676418Z     scale_ub=1200.0,
2025-05-07T20:32:14.6676498Z     contiguous=False,
2025-05-07T20:32:14.6676581Z     compiled=True,
2025-05-07T20:32:14.6676652Z )
2025-05-07T20:32:14.6676883Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6677054Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.6677058Z 
2025-05-07T20:32:14.6677130Z     @given(
2025-05-07T20:32:14.6677250Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6677349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6677462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6677588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6682027Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6682120Z     )
2025-05-07T20:32:14.6682498Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6682597Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6682673Z         self,
2025-05-07T20:32:14.6682975Z         T: int,
2025-05-07T20:32:14.6683095Z         D: int,
2025-05-07T20:32:14.6683223Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6683309Z         contiguous: bool,
2025-05-07T20:32:14.6683393Z         compiled: bool,
2025-05-07T20:32:14.6683473Z     ) -> None:
2025-05-07T20:32:14.6683563Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6683634Z     
2025-05-07T20:32:14.6683810Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6683888Z     
2025-05-07T20:32:14.6683978Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6684111Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6684200Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6684278Z         x0 = x[:, :D]
2025-05-07T20:32:14.6684358Z         x1 = x[:, D:]
2025-05-07T20:32:14.6684440Z     
2025-05-07T20:32:14.6684519Z         if contiguous:
2025-05-07T20:32:14.6684609Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6684705Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6684774Z     
2025-05-07T20:32:14.6684863Z         if scale_ub is not None:
2025-05-07T20:32:14.6684969Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6685105Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6685181Z             )
2025-05-07T20:32:14.6685259Z         else:
2025-05-07T20:32:14.6685351Z             scale_ub_tensor = None
2025-05-07T20:32:14.6685426Z     
2025-05-07T20:32:14.6685558Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6685646Z             op = silu_mul_quant
2025-05-07T20:32:14.6685885Z             if compiled:
2025-05-07T20:32:14.6685983Z                 op = torch.compile(op)
2025-05-07T20:32:14.6686087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6686162Z     
2025-05-07T20:32:14.6686254Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6686259Z 
2025-05-07T20:32:14.6686354Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6686490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6686589Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6686690Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6687084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6687180Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6687723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6687820Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6688207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6688445Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6688808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6688903Z     kernel = self.compile(
2025-05-07T20:32:14.6689313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6689494Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6689629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6689633Z 
2025-05-07T20:32:14.6689843Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefddbd30>
2025-05-07T20:32:14.6690700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6691368Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7c6c1670>}
2025-05-07T20:32:14.6692184Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6692385Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c0b98b0>
2025-05-07T20:32:14.6692390Z 
2025-05-07T20:32:14.6692558Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6692838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6692950Z                            module_map=module_map)
2025-05-07T20:32:14.6693112Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6693211Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6693291Z E       ^
2025-05-07T20:32:14.6693678Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6693683Z 
2025-05-07T20:32:14.6694128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6694133Z 
2025-05-07T20:32:14.6694235Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6694469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6694544Z     T=1,
2025-05-07T20:32:14.6694616Z     D=7168,
2025-05-07T20:32:14.6694700Z     scale_ub=None,
2025-05-07T20:32:14.6694782Z     contiguous=False,
2025-05-07T20:32:14.6694866Z     compiled=True,
2025-05-07T20:32:14.6695020Z )
2025-05-07T20:32:14.6695246Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6695416Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.6695425Z 
2025-05-07T20:32:14.6695499Z     @given(
2025-05-07T20:32:14.6695618Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6695717Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6695832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6695948Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6696064Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6696139Z     )
2025-05-07T20:32:14.6696397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6696495Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6696569Z         self,
2025-05-07T20:32:14.6696650Z         T: int,
2025-05-07T20:32:14.6696724Z         D: int,
2025-05-07T20:32:14.6696825Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6696915Z         contiguous: bool,
2025-05-07T20:32:14.6696999Z         compiled: bool,
2025-05-07T20:32:14.6697077Z     ) -> None:
2025-05-07T20:32:14.6697175Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6697247Z     
2025-05-07T20:32:14.6697419Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6697492Z     
2025-05-07T20:32:14.6697580Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6697702Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6697792Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6697871Z         x0 = x[:, :D]
2025-05-07T20:32:14.6697949Z         x1 = x[:, D:]
2025-05-07T20:32:14.6698025Z     
2025-05-07T20:32:14.6698102Z         if contiguous:
2025-05-07T20:32:14.6698195Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6698282Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6698349Z     
2025-05-07T20:32:14.6698445Z         if scale_ub is not None:
2025-05-07T20:32:14.6698545Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6698679Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6698836Z             )
2025-05-07T20:32:14.6698914Z         else:
2025-05-07T20:32:14.6699006Z             scale_ub_tensor = None
2025-05-07T20:32:14.6699086Z     
2025-05-07T20:32:14.6699214Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6699300Z             op = silu_mul_quant
2025-05-07T20:32:14.6699388Z             if compiled:
2025-05-07T20:32:14.6699488Z                 op = torch.compile(op)
2025-05-07T20:32:14.6699594Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6699667Z     
2025-05-07T20:32:14.6699755Z         y_fp8, y_scale = fn()
2025-05-07T20:32:14.6699878Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:14.6699949Z     
2025-05-07T20:32:14.6700083Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6700189Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:14.6700285Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:14.6700406Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:14.6700555Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6700629Z     
2025-05-07T20:32:14.6700727Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:14.6700735Z 
2025-05-07T20:32:14.6700830Z moe/activation_test.py:126: 
2025-05-07T20:32:14.6700959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6701064Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:14.6701197Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:14.6701804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:14.6701905Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:14.6702397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6702635Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6703028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:14.6703293Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6703724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:14.6703986Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:14.6704385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:14.6704558Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:14.6704925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:14.6705004Z     fn()
2025-05-07T20:32:14.6705433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:14.6705514Z     self.fn.run(
2025-05-07T20:32:14.6705872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6705965Z     kernel = self.compile(
2025-05-07T20:32:14.6706370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6706551Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6706679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6706683Z 
2025-05-07T20:32:14.6706896Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c082a60>
2025-05-07T20:32:14.6707826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6708382Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7c188430>}
2025-05-07T20:32:14.6709194Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6709390Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c1862f0>
2025-05-07T20:32:14.6709394Z 
2025-05-07T20:32:14.6709568Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6709921Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6710037Z                            module_map=module_map)
2025-05-07T20:32:14.6710204Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6710307Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:14.6710382Z E       ^
2025-05-07T20:32:14.6710756Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6710761Z 
2025-05-07T20:32:14.6711205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6711213Z 
2025-05-07T20:32:14.6711315Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6711544Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6711624Z     T=1,
2025-05-07T20:32:14.6711700Z     D=5120,
2025-05-07T20:32:14.6711782Z     scale_ub=1200.0,
2025-05-07T20:32:14.6711952Z     contiguous=False,
2025-05-07T20:32:14.6712033Z     compiled=True,
2025-05-07T20:32:14.6712104Z )
2025-05-07T20:32:14.6712330Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6712503Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.6712507Z 
2025-05-07T20:32:14.6712583Z     @given(
2025-05-07T20:32:14.6712699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6712799Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6712918Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6713032Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6713142Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6713220Z     )
2025-05-07T20:32:14.6713476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6713565Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6713653Z         self,
2025-05-07T20:32:14.6713728Z         T: int,
2025-05-07T20:32:14.6713802Z         D: int,
2025-05-07T20:32:14.6713902Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6713989Z         contiguous: bool,
2025-05-07T20:32:14.6714081Z         compiled: bool,
2025-05-07T20:32:14.6714159Z     ) -> None:
2025-05-07T20:32:14.6714252Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6714327Z     
2025-05-07T20:32:14.6714497Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6714568Z     
2025-05-07T20:32:14.6714658Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6714780Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6714870Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6714957Z         x0 = x[:, :D]
2025-05-07T20:32:14.6715036Z         x1 = x[:, D:]
2025-05-07T20:32:14.6715109Z     
2025-05-07T20:32:14.6715196Z         if contiguous:
2025-05-07T20:32:14.6715285Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6715382Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6715454Z     
2025-05-07T20:32:14.6715540Z         if scale_ub is not None:
2025-05-07T20:32:14.6715647Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6715864Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6715939Z             )
2025-05-07T20:32:14.6716017Z         else:
2025-05-07T20:32:14.6716110Z             scale_ub_tensor = None
2025-05-07T20:32:14.6716182Z     
2025-05-07T20:32:14.6716313Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6716401Z             op = silu_mul_quant
2025-05-07T20:32:14.6716485Z             if compiled:
2025-05-07T20:32:14.6716591Z                 op = torch.compile(op)
2025-05-07T20:32:14.6716694Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6716770Z     
2025-05-07T20:32:14.6716860Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6716865Z 
2025-05-07T20:32:14.6716959Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6717093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6717190Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6717287Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6717687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6717778Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6718314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6718413Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6718794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6719027Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6719391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6719564Z     kernel = self.compile(
2025-05-07T20:32:14.6719975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6720158Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6720293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6720297Z 
2025-05-07T20:32:14.6720508Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c2a4220>
2025-05-07T20:32:14.6721355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6721902Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7c188e50>}
2025-05-07T20:32:14.6722718Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6722917Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c052a70>
2025-05-07T20:32:14.6722922Z 
2025-05-07T20:32:14.6723089Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6723365Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6723473Z                            module_map=module_map)
2025-05-07T20:32:14.6723635Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6723735Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6723810Z E       ^
2025-05-07T20:32:14.6724191Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6724201Z 
2025-05-07T20:32:14.6724650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6724654Z 
2025-05-07T20:32:14.6724835Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6725070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6725145Z     T=1,
2025-05-07T20:32:14.6725220Z     D=5120,
2025-05-07T20:32:14.6725309Z     scale_ub=1200.0,
2025-05-07T20:32:14.6725395Z     contiguous=False,
2025-05-07T20:32:14.6725475Z     compiled=False,
2025-05-07T20:32:14.6725547Z )
2025-05-07T20:32:14.6725770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6725942Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.6725946Z 
2025-05-07T20:32:14.6726024Z     @given(
2025-05-07T20:32:14.6726141Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6726248Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6726364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6726477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6726598Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6726670Z     )
2025-05-07T20:32:14.6726926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6727019Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6727093Z         self,
2025-05-07T20:32:14.6727169Z         T: int,
2025-05-07T20:32:14.6727250Z         D: int,
2025-05-07T20:32:14.6727346Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6727430Z         contiguous: bool,
2025-05-07T20:32:14.6727518Z         compiled: bool,
2025-05-07T20:32:14.6727594Z     ) -> None:
2025-05-07T20:32:14.6727689Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6727759Z     
2025-05-07T20:32:14.6727927Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6728087Z     
2025-05-07T20:32:14.6728177Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6728301Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6728397Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6728479Z         x0 = x[:, :D]
2025-05-07T20:32:14.6728560Z         x1 = x[:, D:]
2025-05-07T20:32:14.6728638Z     
2025-05-07T20:32:14.6728718Z         if contiguous:
2025-05-07T20:32:14.6728806Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6728897Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6728972Z     
2025-05-07T20:32:14.6729060Z         if scale_ub is not None:
2025-05-07T20:32:14.6729168Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6729307Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6729387Z             )
2025-05-07T20:32:14.6729464Z         else:
2025-05-07T20:32:14.6729556Z             scale_ub_tensor = None
2025-05-07T20:32:14.6729635Z     
2025-05-07T20:32:14.6729762Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6729850Z             op = silu_mul_quant
2025-05-07T20:32:14.6729935Z             if compiled:
2025-05-07T20:32:14.6730035Z                 op = torch.compile(op)
2025-05-07T20:32:14.6730138Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6730213Z     
2025-05-07T20:32:14.6730303Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6730308Z 
2025-05-07T20:32:14.6730407Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6730535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6730634Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6730735Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6731274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6731368Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6731758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6731990Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6732436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6732529Z     kernel = self.compile(
2025-05-07T20:32:14.6732935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6733112Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6733238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6733242Z 
2025-05-07T20:32:14.6733453Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c041ac0>
2025-05-07T20:32:14.6734303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6734859Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefb79820>}
2025-05-07T20:32:14.6735672Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6735865Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c2a99f0>
2025-05-07T20:32:14.6735870Z 
2025-05-07T20:32:14.6736041Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6736313Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6736418Z                            module_map=module_map)
2025-05-07T20:32:14.6736683Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6736780Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6736857Z E       ^
2025-05-07T20:32:14.6737243Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6737248Z 
2025-05-07T20:32:14.6737692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6737696Z 
2025-05-07T20:32:14.6737800Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6738029Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6738105Z     T=16384,
2025-05-07T20:32:14.6738183Z     D=5120,
2025-05-07T20:32:14.6738262Z     scale_ub=1200.0,
2025-05-07T20:32:14.6738348Z     contiguous=False,
2025-05-07T20:32:14.6738432Z     compiled=True,
2025-05-07T20:32:14.6738505Z )
2025-05-07T20:32:14.6738737Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6738919Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.6738923Z 
2025-05-07T20:32:14.6739005Z     @given(
2025-05-07T20:32:14.6739127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6739224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6739338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6739455Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6739566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6739644Z     )
2025-05-07T20:32:14.6739901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6739991Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6740068Z         self,
2025-05-07T20:32:14.6740141Z         T: int,
2025-05-07T20:32:14.6740216Z         D: int,
2025-05-07T20:32:14.6740333Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6740431Z         contiguous: bool,
2025-05-07T20:32:14.6740533Z         compiled: bool,
2025-05-07T20:32:14.6740621Z     ) -> None:
2025-05-07T20:32:14.6740793Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6740870Z     
2025-05-07T20:32:14.6741047Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6741120Z     
2025-05-07T20:32:14.6741209Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6741334Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6741424Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6741507Z         x0 = x[:, :D]
2025-05-07T20:32:14.6741587Z         x1 = x[:, D:]
2025-05-07T20:32:14.6741659Z     
2025-05-07T20:32:14.6741740Z         if contiguous:
2025-05-07T20:32:14.6741831Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6741916Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6741989Z     
2025-05-07T20:32:14.6742079Z         if scale_ub is not None:
2025-05-07T20:32:14.6742191Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6742327Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6742403Z             )
2025-05-07T20:32:14.6742491Z         else:
2025-05-07T20:32:14.6742589Z             scale_ub_tensor = None
2025-05-07T20:32:14.6742658Z     
2025-05-07T20:32:14.6742785Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6742877Z             op = silu_mul_quant
2025-05-07T20:32:14.6742958Z             if compiled:
2025-05-07T20:32:14.6743057Z                 op = torch.compile(op)
2025-05-07T20:32:14.6743161Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6743234Z     
2025-05-07T20:32:14.6743322Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6743326Z 
2025-05-07T20:32:14.6743427Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6743559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6743658Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6743837Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6744228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6744325Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6744862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6744958Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6745343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6745574Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6745939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6746030Z     kernel = self.compile(
2025-05-07T20:32:14.6746436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6746624Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6746756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6746761Z 
2025-05-07T20:32:14.6746970Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c2b2760>
2025-05-07T20:32:14.6747820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6748367Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7c137790>}
2025-05-07T20:32:14.6749183Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6749381Z context = <triton._C.libtriton.ir.context object at 0x7f1cefb2d1f0>
2025-05-07T20:32:14.6749465Z 
2025-05-07T20:32:14.6749636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6750022Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6750128Z                            module_map=module_map)
2025-05-07T20:32:14.6750293Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6750399Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6750491Z E       ^
2025-05-07T20:32:14.6750898Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6750903Z 
2025-05-07T20:32:14.6751346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6751355Z 
2025-05-07T20:32:14.6751459Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6751694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6751771Z     T=2048,
2025-05-07T20:32:14.6751849Z     D=7168,
2025-05-07T20:32:14.6751929Z     scale_ub=1200.0,
2025-05-07T20:32:14.6752013Z     contiguous=False,
2025-05-07T20:32:14.6752099Z     compiled=True,
2025-05-07T20:32:14.6752168Z )
2025-05-07T20:32:14.6752393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6752572Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.6752577Z 
2025-05-07T20:32:14.6752648Z     @given(
2025-05-07T20:32:14.6752771Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6752865Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6752978Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6753184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6753295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6753374Z     )
2025-05-07T20:32:14.6753634Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6753726Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6753805Z         self,
2025-05-07T20:32:14.6753879Z         T: int,
2025-05-07T20:32:14.6753955Z         D: int,
2025-05-07T20:32:14.6754054Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6754141Z         contiguous: bool,
2025-05-07T20:32:14.6754224Z         compiled: bool,
2025-05-07T20:32:14.6754303Z     ) -> None:
2025-05-07T20:32:14.6754395Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6754469Z     
2025-05-07T20:32:14.6754643Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6754715Z     
2025-05-07T20:32:14.6754803Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6754933Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6755018Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6755100Z         x0 = x[:, :D]
2025-05-07T20:32:14.6755177Z         x1 = x[:, D:]
2025-05-07T20:32:14.6755252Z     
2025-05-07T20:32:14.6755337Z         if contiguous:
2025-05-07T20:32:14.6755426Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6755514Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6755587Z     
2025-05-07T20:32:14.6755675Z         if scale_ub is not None:
2025-05-07T20:32:14.6755777Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6755916Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6755993Z             )
2025-05-07T20:32:14.6756068Z         else:
2025-05-07T20:32:14.6756162Z             scale_ub_tensor = None
2025-05-07T20:32:14.6756234Z     
2025-05-07T20:32:14.6756365Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6756451Z             op = silu_mul_quant
2025-05-07T20:32:14.6756539Z             if compiled:
2025-05-07T20:32:14.6756642Z                 op = torch.compile(op)
2025-05-07T20:32:14.6756745Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6756899Z     
2025-05-07T20:32:14.6756992Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6756997Z 
2025-05-07T20:32:14.6757092Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6757222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6757325Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6757424Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6757819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6757911Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6758448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6758554Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6758936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6759171Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6759535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6759628Z     kernel = self.compile(
2025-05-07T20:32:14.6760039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6760216Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6760344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6760349Z 
2025-05-07T20:32:14.6760561Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefb76c40>
2025-05-07T20:32:14.6761409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6762122Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefadf4c0>}
2025-05-07T20:32:14.6762934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6763129Z context = <triton._C.libtriton.ir.context object at 0x7f1cefade730>
2025-05-07T20:32:14.6763138Z 
2025-05-07T20:32:14.6763304Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6763577Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6763690Z                            module_map=module_map)
2025-05-07T20:32:14.6763852Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6763951Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6764037Z E       ^
2025-05-07T20:32:14.6764417Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6764802Z 
2025-05-07T20:32:14.6765252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6765256Z 
2025-05-07T20:32:14.6765357Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6765586Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6765671Z     T=1,
2025-05-07T20:32:14.6765750Z     D=5120,
2025-05-07T20:32:14.6765833Z     scale_ub=None,
2025-05-07T20:32:14.6765928Z     contiguous=False,
2025-05-07T20:32:14.6766015Z     compiled=False,
2025-05-07T20:32:14.6766098Z )
2025-05-07T20:32:14.6766322Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6766596Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.6766601Z 
2025-05-07T20:32:14.6766681Z     @given(
2025-05-07T20:32:14.6766798Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6766892Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6767009Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6767124Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6767233Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6767311Z     )
2025-05-07T20:32:14.6767567Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6767665Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6767743Z         self,
2025-05-07T20:32:14.6767819Z         T: int,
2025-05-07T20:32:14.6767908Z         D: int,
2025-05-07T20:32:14.6768004Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6768091Z         contiguous: bool,
2025-05-07T20:32:14.6768181Z         compiled: bool,
2025-05-07T20:32:14.6768259Z     ) -> None:
2025-05-07T20:32:14.6768355Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6768430Z     
2025-05-07T20:32:14.6768600Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6768671Z     
2025-05-07T20:32:14.6768765Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6768886Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6768976Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6769054Z         x0 = x[:, :D]
2025-05-07T20:32:14.6769129Z         x1 = x[:, D:]
2025-05-07T20:32:14.6769208Z     
2025-05-07T20:32:14.6769287Z         if contiguous:
2025-05-07T20:32:14.6769374Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6769466Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6769537Z     
2025-05-07T20:32:14.6769712Z         if scale_ub is not None:
2025-05-07T20:32:14.6769818Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6769952Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6770033Z             )
2025-05-07T20:32:14.6770109Z         else:
2025-05-07T20:32:14.6770210Z             scale_ub_tensor = None
2025-05-07T20:32:14.6770293Z     
2025-05-07T20:32:14.6770446Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6770536Z             op = silu_mul_quant
2025-05-07T20:32:14.6770622Z             if compiled:
2025-05-07T20:32:14.6770720Z                 op = torch.compile(op)
2025-05-07T20:32:14.6770824Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6770899Z     
2025-05-07T20:32:14.6770987Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6770992Z 
2025-05-07T20:32:14.6771085Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6771225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6771327Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6771425Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6771970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6772069Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6772454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6772688Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6773051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6773144Z     kernel = self.compile(
2025-05-07T20:32:14.6773553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6773738Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6773874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6773879Z 
2025-05-07T20:32:14.6774169Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefb0e370>
2025-05-07T20:32:14.6775016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6775559Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefadf820>}
2025-05-07T20:32:14.6776375Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6776575Z context = <triton._C.libtriton.ir.context object at 0x7f1d7c1dec30>
2025-05-07T20:32:14.6776579Z 
2025-05-07T20:32:14.6776749Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6777032Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6777139Z                            module_map=module_map)
2025-05-07T20:32:14.6777303Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6777402Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6777480Z E       ^
2025-05-07T20:32:14.6777860Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6777864Z 
2025-05-07T20:32:14.6778308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6778312Z 
2025-05-07T20:32:14.6778446Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6778826Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6778904Z     T=4096,
2025-05-07T20:32:14.6778987Z     D=7168,
2025-05-07T20:32:14.6779078Z     scale_ub=1200.0,
2025-05-07T20:32:14.6779168Z     contiguous=False,
2025-05-07T20:32:14.6779256Z     compiled=False,
2025-05-07T20:32:14.6779330Z )
2025-05-07T20:32:14.6779556Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6779745Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.6779749Z 
2025-05-07T20:32:14.6779830Z     @given(
2025-05-07T20:32:14.6779951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6780050Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6780167Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6780303Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6780428Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6780529Z     )
2025-05-07T20:32:14.6780794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6780888Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6780970Z         self,
2025-05-07T20:32:14.6781054Z         T: int,
2025-05-07T20:32:14.6781133Z         D: int,
2025-05-07T20:32:14.6781232Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6781324Z         contiguous: bool,
2025-05-07T20:32:14.6781411Z         compiled: bool,
2025-05-07T20:32:14.6781495Z     ) -> None:
2025-05-07T20:32:14.6781592Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6781664Z     
2025-05-07T20:32:14.6781841Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6781917Z     
2025-05-07T20:32:14.6782010Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6782137Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6782227Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6782313Z         x0 = x[:, :D]
2025-05-07T20:32:14.6782398Z         x1 = x[:, D:]
2025-05-07T20:32:14.6782470Z     
2025-05-07T20:32:14.6782552Z         if contiguous:
2025-05-07T20:32:14.6782730Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6783238Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6783321Z     
2025-05-07T20:32:14.6783411Z         if scale_ub is not None:
2025-05-07T20:32:14.6783520Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6783661Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6783737Z             )
2025-05-07T20:32:14.6783811Z         else:
2025-05-07T20:32:14.6783909Z             scale_ub_tensor = None
2025-05-07T20:32:14.6783988Z     
2025-05-07T20:32:14.6784119Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6784217Z             op = silu_mul_quant
2025-05-07T20:32:14.6784303Z             if compiled:
2025-05-07T20:32:14.6784401Z                 op = torch.compile(op)
2025-05-07T20:32:14.6784513Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6784587Z     
2025-05-07T20:32:14.6784681Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6784686Z 
2025-05-07T20:32:14.6784787Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6784921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6785023Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6785122Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6785664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6785764Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6786147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6786383Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6786745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6786983Z     kernel = self.compile(
2025-05-07T20:32:14.6787400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6787579Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6787710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6787721Z 
2025-05-07T20:32:14.6787930Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c1e2a30>
2025-05-07T20:32:14.6788780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6789332Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7c1dcaf0>}
2025-05-07T20:32:14.6790220Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6790424Z context = <triton._C.libtriton.ir.context object at 0x7f1cefd45570>
2025-05-07T20:32:14.6790429Z 
2025-05-07T20:32:14.6790596Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6790873Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6790987Z                            module_map=module_map)
2025-05-07T20:32:14.6791151Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6791250Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6791331Z E       ^
2025-05-07T20:32:14.6791712Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6791723Z 
2025-05-07T20:32:14.6792288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6792294Z 
2025-05-07T20:32:14.6792398Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6792633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6792715Z     T=16384,
2025-05-07T20:32:14.6792791Z     D=7168,
2025-05-07T20:32:14.6792877Z     scale_ub=None,
2025-05-07T20:32:14.6792960Z     contiguous=True,
2025-05-07T20:32:14.6793042Z     compiled=True,
2025-05-07T20:32:14.6793118Z )
2025-05-07T20:32:14.6793343Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6793525Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.6793530Z 
2025-05-07T20:32:14.6793615Z     @given(
2025-05-07T20:32:14.6793738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6793837Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6793959Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6794083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6794200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6794273Z     )
2025-05-07T20:32:14.6794532Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6794627Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6794703Z         self,
2025-05-07T20:32:14.6794779Z         T: int,
2025-05-07T20:32:14.6794859Z         D: int,
2025-05-07T20:32:14.6794958Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6795046Z         contiguous: bool,
2025-05-07T20:32:14.6795137Z         compiled: bool,
2025-05-07T20:32:14.6795215Z     ) -> None:
2025-05-07T20:32:14.6795310Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6795386Z     
2025-05-07T20:32:14.6795649Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6795727Z     
2025-05-07T20:32:14.6795821Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6795950Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6796042Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6796122Z         x0 = x[:, :D]
2025-05-07T20:32:14.6796200Z         x1 = x[:, D:]
2025-05-07T20:32:14.6796276Z     
2025-05-07T20:32:14.6796360Z         if contiguous:
2025-05-07T20:32:14.6796451Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6796548Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6796626Z     
2025-05-07T20:32:14.6796717Z         if scale_ub is not None:
2025-05-07T20:32:14.6796829Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6796964Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6797040Z             )
2025-05-07T20:32:14.6797119Z         else:
2025-05-07T20:32:14.6797212Z             scale_ub_tensor = None
2025-05-07T20:32:14.6797297Z     
2025-05-07T20:32:14.6797428Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6797517Z             op = silu_mul_quant
2025-05-07T20:32:14.6797612Z             if compiled:
2025-05-07T20:32:14.6797715Z                 op = torch.compile(op)
2025-05-07T20:32:14.6797822Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6797896Z     
2025-05-07T20:32:14.6797989Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6797994Z 
2025-05-07T20:32:14.6798089Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6798224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6798323Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6798426Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6798818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6798911Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6799457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6799554Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6800040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6800280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6800642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6800741Z     kernel = self.compile(
2025-05-07T20:32:14.6801150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6801329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6801470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6801479Z 
2025-05-07T20:32:14.6801689Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c0ff8e0>
2025-05-07T20:32:14.6802550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6803097Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefeec790>}
2025-05-07T20:32:14.6803911Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6804112Z context = <triton._C.libtriton.ir.context object at 0x7f1cefec35b0>
2025-05-07T20:32:14.6804117Z 
2025-05-07T20:32:14.6804285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6804648Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6804761Z                            module_map=module_map)
2025-05-07T20:32:14.6804924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6805024Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6805100Z E       ^
2025-05-07T20:32:14.6805484Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6805493Z 
2025-05-07T20:32:14.6805940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6805944Z 
2025-05-07T20:32:14.6806049Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6806286Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6806377Z     T=4096,
2025-05-07T20:32:14.6806454Z     D=5120,
2025-05-07T20:32:14.6806545Z     scale_ub=None,
2025-05-07T20:32:14.6810090Z     contiguous=False,
2025-05-07T20:32:14.6810195Z     compiled=True,
2025-05-07T20:32:14.6810291Z )
2025-05-07T20:32:14.6810536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6810718Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.6810723Z 
2025-05-07T20:32:14.6810802Z     @given(
2025-05-07T20:32:14.6810928Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6811027Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6811142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6811261Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6811376Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6811457Z     )
2025-05-07T20:32:14.6811717Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6811821Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6811906Z         self,
2025-05-07T20:32:14.6811985Z         T: int,
2025-05-07T20:32:14.6812062Z         D: int,
2025-05-07T20:32:14.6812270Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6812362Z         contiguous: bool,
2025-05-07T20:32:14.6812451Z         compiled: bool,
2025-05-07T20:32:14.6812541Z     ) -> None:
2025-05-07T20:32:14.6812636Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6812713Z     
2025-05-07T20:32:14.6812894Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6812968Z     
2025-05-07T20:32:14.6813067Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6813194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6813285Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6813372Z         x0 = x[:, :D]
2025-05-07T20:32:14.6813452Z         x1 = x[:, D:]
2025-05-07T20:32:14.6813526Z     
2025-05-07T20:32:14.6813617Z         if contiguous:
2025-05-07T20:32:14.6813714Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6813803Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6813885Z     
2025-05-07T20:32:14.6813984Z         if scale_ub is not None:
2025-05-07T20:32:14.6814091Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6814232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6814309Z             )
2025-05-07T20:32:14.6814386Z         else:
2025-05-07T20:32:14.6814485Z             scale_ub_tensor = None
2025-05-07T20:32:14.6814559Z     
2025-05-07T20:32:14.6814693Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6814784Z             op = silu_mul_quant
2025-05-07T20:32:14.6814873Z             if compiled:
2025-05-07T20:32:14.6814980Z                 op = torch.compile(op)
2025-05-07T20:32:14.6815089Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6815163Z     
2025-05-07T20:32:14.6815261Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6815349Z 
2025-05-07T20:32:14.6815448Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6815580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6815693Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6815795Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6816200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6816293Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6816836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6816942Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6817329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6817563Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6817939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6818035Z     kernel = self.compile(
2025-05-07T20:32:14.6818460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6818640Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6818772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6818777Z 
2025-05-07T20:32:14.6818993Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefd3f6d0>
2025-05-07T20:32:14.6819886Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6820456Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefdb2550>}
2025-05-07T20:32:14.6821368Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6821572Z context = <triton._C.libtriton.ir.context object at 0x7f1cefdb1030>
2025-05-07T20:32:14.6821576Z 
2025-05-07T20:32:14.6821747Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6822025Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6822138Z                            module_map=module_map)
2025-05-07T20:32:14.6822301Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6822400Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6822484Z E       ^
2025-05-07T20:32:14.6822879Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6822883Z 
2025-05-07T20:32:14.6823339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6823344Z 
2025-05-07T20:32:14.6823447Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6823682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6823763Z     T=4096,
2025-05-07T20:32:14.6823843Z     D=5120,
2025-05-07T20:32:14.6823929Z     scale_ub=1200.0,
2025-05-07T20:32:14.6824025Z     contiguous=False,
2025-05-07T20:32:14.6824109Z     compiled=False,
2025-05-07T20:32:14.6824184Z )
2025-05-07T20:32:14.6824415Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6824599Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.6824603Z 
2025-05-07T20:32:14.6824767Z     @given(
2025-05-07T20:32:14.6824886Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6824985Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6825109Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6825229Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6825344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6825423Z     )
2025-05-07T20:32:14.6825685Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6825782Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6825861Z         self,
2025-05-07T20:32:14.6825939Z         T: int,
2025-05-07T20:32:14.6826021Z         D: int,
2025-05-07T20:32:14.6826121Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6826211Z         contiguous: bool,
2025-05-07T20:32:14.6826305Z         compiled: bool,
2025-05-07T20:32:14.6826384Z     ) -> None:
2025-05-07T20:32:14.6826477Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6826561Z     
2025-05-07T20:32:14.6826734Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6826808Z     
2025-05-07T20:32:14.6826912Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6827036Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6827125Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6827211Z         x0 = x[:, :D]
2025-05-07T20:32:14.6827293Z         x1 = x[:, D:]
2025-05-07T20:32:14.6827366Z     
2025-05-07T20:32:14.6827457Z         if contiguous:
2025-05-07T20:32:14.6827549Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6827641Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6827717Z     
2025-05-07T20:32:14.6827810Z         if scale_ub is not None:
2025-05-07T20:32:14.6827918Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6828057Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6828132Z             )
2025-05-07T20:32:14.6828220Z         else:
2025-05-07T20:32:14.6828315Z             scale_ub_tensor = None
2025-05-07T20:32:14.6828388Z     
2025-05-07T20:32:14.6828524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6828693Z             op = silu_mul_quant
2025-05-07T20:32:14.6828786Z             if compiled:
2025-05-07T20:32:14.6828890Z                 op = torch.compile(op)
2025-05-07T20:32:14.6828999Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6829075Z     
2025-05-07T20:32:14.6829166Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6829170Z 
2025-05-07T20:32:14.6829273Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6829409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6829512Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6829613Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6830277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6830381Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6830768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6831009Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6831373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6831471Z     kernel = self.compile(
2025-05-07T20:32:14.6831881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6832060Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6832192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6832197Z 
2025-05-07T20:32:14.6832408Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefd3f430>
2025-05-07T20:32:14.6833371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6833919Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefcc50d0>}
2025-05-07T20:32:14.6834737Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6834935Z context = <triton._C.libtriton.ir.context object at 0x7f1cefffaa70>
2025-05-07T20:32:14.6834939Z 
2025-05-07T20:32:14.6835109Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6835388Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6835504Z                            module_map=module_map)
2025-05-07T20:32:14.6835668Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6835775Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6835854Z E       ^
2025-05-07T20:32:14.6836244Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6836248Z 
2025-05-07T20:32:14.6836698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6836702Z 
2025-05-07T20:32:14.6836808Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6837045Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6837124Z     T=4096,
2025-05-07T20:32:14.6837204Z     D=5120,
2025-05-07T20:32:14.6837289Z     scale_ub=1200.0,
2025-05-07T20:32:14.6837380Z     contiguous=False,
2025-05-07T20:32:14.6837472Z     compiled=True,
2025-05-07T20:32:14.6837545Z )
2025-05-07T20:32:14.6837772Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6838033Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.6838038Z 
2025-05-07T20:32:14.6838119Z     @given(
2025-05-07T20:32:14.6838240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6838342Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6838458Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6838580Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6838693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6838767Z     )
2025-05-07T20:32:14.6839030Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6839126Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6839203Z         self,
2025-05-07T20:32:14.6839291Z         T: int,
2025-05-07T20:32:14.6839368Z         D: int,
2025-05-07T20:32:14.6839468Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6839567Z         contiguous: bool,
2025-05-07T20:32:14.6839658Z         compiled: bool,
2025-05-07T20:32:14.6839739Z     ) -> None:
2025-05-07T20:32:14.6839838Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6839912Z     
2025-05-07T20:32:14.6840091Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6840166Z     
2025-05-07T20:32:14.6840259Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6840386Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6840476Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6840558Z         x0 = x[:, :D]
2025-05-07T20:32:14.6840645Z         x1 = x[:, D:]
2025-05-07T20:32:14.6840719Z     
2025-05-07T20:32:14.6840803Z         if contiguous:
2025-05-07T20:32:14.6840903Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6840992Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6841149Z     
2025-05-07T20:32:14.6841243Z         if scale_ub is not None:
2025-05-07T20:32:14.6841350Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6841491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6841575Z             )
2025-05-07T20:32:14.6841654Z         else:
2025-05-07T20:32:14.6841752Z             scale_ub_tensor = None
2025-05-07T20:32:14.6841826Z     
2025-05-07T20:32:14.6841957Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6842055Z             op = silu_mul_quant
2025-05-07T20:32:14.6842140Z             if compiled:
2025-05-07T20:32:14.6842243Z                 op = torch.compile(op)
2025-05-07T20:32:14.6842353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6842425Z     
2025-05-07T20:32:14.6842517Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6842522Z 
2025-05-07T20:32:14.6842623Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6842756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6842868Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6842968Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6843365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6843462Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6844000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6844098Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6844485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6844718Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6845083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6845180Z     kernel = self.compile(
2025-05-07T20:32:14.6845591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6845850Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6845984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6845988Z 
2025-05-07T20:32:14.6846202Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefd3a850>
2025-05-07T20:32:14.6847052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6847601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefcc5dc0>}
2025-05-07T20:32:14.6848417Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6848617Z context = <triton._C.libtriton.ir.context object at 0x7f1cefbfbcb0>
2025-05-07T20:32:14.6848621Z 
2025-05-07T20:32:14.6848795Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6849067Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6849176Z                            module_map=module_map)
2025-05-07T20:32:14.6849344Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6849441Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6849515Z E       ^
2025-05-07T20:32:14.6849895Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6849977Z 
2025-05-07T20:32:14.6850420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6850425Z 
2025-05-07T20:32:14.6850537Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6850769Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6850844Z     T=2048,
2025-05-07T20:32:14.6850924Z     D=7168,
2025-05-07T20:32:14.6851006Z     scale_ub=1200.0,
2025-05-07T20:32:14.6851095Z     contiguous=False,
2025-05-07T20:32:14.6851180Z     compiled=False,
2025-05-07T20:32:14.6851249Z )
2025-05-07T20:32:14.6851476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6851653Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.6851657Z 
2025-05-07T20:32:14.6851730Z     @given(
2025-05-07T20:32:14.6851854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6851958Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6852071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6852198Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6852314Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6852389Z     )
2025-05-07T20:32:14.6852644Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6852734Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6852810Z         self,
2025-05-07T20:32:14.6852886Z         T: int,
2025-05-07T20:32:14.6852958Z         D: int,
2025-05-07T20:32:14.6853062Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6853148Z         contiguous: bool,
2025-05-07T20:32:14.6853232Z         compiled: bool,
2025-05-07T20:32:14.6853315Z     ) -> None:
2025-05-07T20:32:14.6853407Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6853477Z     
2025-05-07T20:32:14.6853648Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6853726Z     
2025-05-07T20:32:14.6853823Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6853944Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6854115Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6854204Z         x0 = x[:, :D]
2025-05-07T20:32:14.6854280Z         x1 = x[:, D:]
2025-05-07T20:32:14.6854352Z     
2025-05-07T20:32:14.6854435Z         if contiguous:
2025-05-07T20:32:14.6854524Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6854613Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6854691Z     
2025-05-07T20:32:14.6854780Z         if scale_ub is not None:
2025-05-07T20:32:14.6854884Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6855021Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6855097Z             )
2025-05-07T20:32:14.6855175Z         else:
2025-05-07T20:32:14.6855273Z             scale_ub_tensor = None
2025-05-07T20:32:14.6855343Z     
2025-05-07T20:32:14.6855484Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6855573Z             op = silu_mul_quant
2025-05-07T20:32:14.6855656Z             if compiled:
2025-05-07T20:32:14.6855766Z                 op = torch.compile(op)
2025-05-07T20:32:14.6855872Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6855941Z     
2025-05-07T20:32:14.6856033Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6856038Z 
2025-05-07T20:32:14.6856133Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6856263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6856364Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6856462Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6857002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6857096Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6857475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6857792Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6858159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6858255Z     kernel = self.compile(
2025-05-07T20:32:14.6858667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6858844Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6858975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6858979Z 
2025-05-07T20:32:14.6859188Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c013880>
2025-05-07T20:32:14.6860036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6860591Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1d7c027670>}
2025-05-07T20:32:14.6861404Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6861603Z context = <triton._C.libtriton.ir.context object at 0x7f1cefc2a6b0>
2025-05-07T20:32:14.6861607Z 
2025-05-07T20:32:14.6861773Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6862048Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6862153Z                            module_map=module_map)
2025-05-07T20:32:14.6862317Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6862419Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6862491Z E       ^
2025-05-07T20:32:14.6862949Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6862955Z 
2025-05-07T20:32:14.6863401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6863406Z 
2025-05-07T20:32:14.6863507Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6863742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6863819Z     T=1,
2025-05-07T20:32:14.6863896Z     D=7168,
2025-05-07T20:32:14.6863979Z     scale_ub=None,
2025-05-07T20:32:14.6864062Z     contiguous=True,
2025-05-07T20:32:14.6864144Z     compiled=False,
2025-05-07T20:32:14.6864219Z )
2025-05-07T20:32:14.6864442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6864615Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.6864625Z 
2025-05-07T20:32:14.6864703Z     @given(
2025-05-07T20:32:14.6864824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6864923Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6865036Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6865151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6865266Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6865338Z     )
2025-05-07T20:32:14.6865596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6865689Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6865763Z         self,
2025-05-07T20:32:14.6865842Z         T: int,
2025-05-07T20:32:14.6865923Z         D: int,
2025-05-07T20:32:14.6866019Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6866218Z         contiguous: bool,
2025-05-07T20:32:14.6866302Z         compiled: bool,
2025-05-07T20:32:14.6866380Z     ) -> None:
2025-05-07T20:32:14.6866479Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6866555Z     
2025-05-07T20:32:14.6866723Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6866800Z     
2025-05-07T20:32:14.6866893Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6867014Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6867108Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6867183Z         x0 = x[:, :D]
2025-05-07T20:32:14.6867260Z         x1 = x[:, D:]
2025-05-07T20:32:14.6867341Z     
2025-05-07T20:32:14.6867422Z         if contiguous:
2025-05-07T20:32:14.6867514Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6867602Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6867676Z     
2025-05-07T20:32:14.6867771Z         if scale_ub is not None:
2025-05-07T20:32:14.6867874Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6868014Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6868094Z             )
2025-05-07T20:32:14.6868168Z         else:
2025-05-07T20:32:14.6868262Z             scale_ub_tensor = None
2025-05-07T20:32:14.6868335Z     
2025-05-07T20:32:14.6868462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6868549Z             op = silu_mul_quant
2025-05-07T20:32:14.6868638Z             if compiled:
2025-05-07T20:32:14.6868737Z                 op = torch.compile(op)
2025-05-07T20:32:14.6868844Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6868916Z     
2025-05-07T20:32:14.6869014Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6869018Z 
2025-05-07T20:32:14.6869116Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6869257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6869355Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6869457Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6870102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6870281Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6870668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6870901Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6871261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6871352Z     kernel = self.compile(
2025-05-07T20:32:14.6871762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6871939Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6872071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6872079Z 
2025-05-07T20:32:14.6872286Z self = <triton.compiler.compiler.ASTSource object at 0x7f1d7c009820>
2025-05-07T20:32:14.6873135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6873680Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef902160>}
2025-05-07T20:32:14.6874492Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6874688Z context = <triton._C.libtriton.ir.context object at 0x7f1cef919cb0>
2025-05-07T20:32:14.6874769Z 
2025-05-07T20:32:14.6874938Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6875221Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6875328Z                            module_map=module_map)
2025-05-07T20:32:14.6875494Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6875591Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6875667Z E       ^
2025-05-07T20:32:14.6876047Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6876052Z 
2025-05-07T20:32:14.6876494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6876498Z 
2025-05-07T20:32:14.6876601Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6876835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6876915Z     T=16384,
2025-05-07T20:32:14.6876996Z     D=7168,
2025-05-07T20:32:14.6877076Z     scale_ub=1200.0,
2025-05-07T20:32:14.6877162Z     contiguous=False,
2025-05-07T20:32:14.6877252Z     compiled=True,
2025-05-07T20:32:14.6877324Z )
2025-05-07T20:32:14.6877547Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6877735Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.6877739Z 
2025-05-07T20:32:14.6877816Z     @given(
2025-05-07T20:32:14.6877931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6878035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6878148Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6878270Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6878380Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6878453Z     )
2025-05-07T20:32:14.6878716Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6878807Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6878882Z         self,
2025-05-07T20:32:14.6878958Z         T: int,
2025-05-07T20:32:14.6879117Z         D: int,
2025-05-07T20:32:14.6879217Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6879310Z         contiguous: bool,
2025-05-07T20:32:14.6879394Z         compiled: bool,
2025-05-07T20:32:14.6879472Z     ) -> None:
2025-05-07T20:32:14.6879571Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6879644Z     
2025-05-07T20:32:14.6879818Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6879890Z     
2025-05-07T20:32:14.6879980Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6880109Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6880197Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6880278Z         x0 = x[:, :D]
2025-05-07T20:32:14.6880361Z         x1 = x[:, D:]
2025-05-07T20:32:14.6880435Z     
2025-05-07T20:32:14.6880516Z         if contiguous:
2025-05-07T20:32:14.6880612Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6880703Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6880784Z     
2025-05-07T20:32:14.6880877Z         if scale_ub is not None:
2025-05-07T20:32:14.6880982Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6881120Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6881191Z             )
2025-05-07T20:32:14.6881267Z         else:
2025-05-07T20:32:14.6881363Z             scale_ub_tensor = None
2025-05-07T20:32:14.6881433Z     
2025-05-07T20:32:14.6881560Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6881652Z             op = silu_mul_quant
2025-05-07T20:32:14.6881735Z             if compiled:
2025-05-07T20:32:14.6881831Z                 op = torch.compile(op)
2025-05-07T20:32:14.6881936Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6882090Z     
2025-05-07T20:32:14.6882180Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6882189Z 
2025-05-07T20:32:14.6882285Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6882421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6882523Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6882620Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6883287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6883391Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6883927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6884027Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6884411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6884642Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6885011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6885109Z     kernel = self.compile(
2025-05-07T20:32:14.6885517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6885698Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6885828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6885832Z 
2025-05-07T20:32:14.6886046Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefc27670>
2025-05-07T20:32:14.6886891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6887441Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef9024c0>}
2025-05-07T20:32:14.6888402Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6888603Z context = <triton._C.libtriton.ir.context object at 0x7f1cefc7cbb0>
2025-05-07T20:32:14.6888608Z 
2025-05-07T20:32:14.6888778Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6889051Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6889156Z                            module_map=module_map)
2025-05-07T20:32:14.6889322Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6889420Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6889505Z E       ^
2025-05-07T20:32:14.6889884Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6889889Z 
2025-05-07T20:32:14.6890341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6890345Z 
2025-05-07T20:32:14.6890453Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6890681Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6890763Z     T=1,
2025-05-07T20:32:14.6890841Z     D=7168,
2025-05-07T20:32:14.6890923Z     scale_ub=None,
2025-05-07T20:32:14.6891014Z     contiguous=False,
2025-05-07T20:32:14.6891098Z     compiled=False,
2025-05-07T20:32:14.6891173Z )
2025-05-07T20:32:14.6891400Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6891570Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.6891693Z 
2025-05-07T20:32:14.6891772Z     @given(
2025-05-07T20:32:14.6891895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6891992Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6892110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6892227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6892337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6892415Z     )
2025-05-07T20:32:14.6892671Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6892762Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6892845Z         self,
2025-05-07T20:32:14.6892923Z         T: int,
2025-05-07T20:32:14.6893001Z         D: int,
2025-05-07T20:32:14.6893102Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6893192Z         contiguous: bool,
2025-05-07T20:32:14.6893274Z         compiled: bool,
2025-05-07T20:32:14.6893357Z     ) -> None:
2025-05-07T20:32:14.6893456Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6893530Z     
2025-05-07T20:32:14.6893703Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6893778Z     
2025-05-07T20:32:14.6893877Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6894000Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6894087Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6894174Z         x0 = x[:, :D]
2025-05-07T20:32:14.6894256Z         x1 = x[:, D:]
2025-05-07T20:32:14.6894331Z     
2025-05-07T20:32:14.6894416Z         if contiguous:
2025-05-07T20:32:14.6894508Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6894597Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6894677Z     
2025-05-07T20:32:14.6894768Z         if scale_ub is not None:
2025-05-07T20:32:14.6894874Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6895012Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6895094Z             )
2025-05-07T20:32:14.6895174Z         else:
2025-05-07T20:32:14.6895272Z             scale_ub_tensor = None
2025-05-07T20:32:14.6895344Z     
2025-05-07T20:32:14.6895583Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6895672Z             op = silu_mul_quant
2025-05-07T20:32:14.6895757Z             if compiled:
2025-05-07T20:32:14.6895859Z                 op = torch.compile(op)
2025-05-07T20:32:14.6895964Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6896035Z     
2025-05-07T20:32:14.6896126Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6896130Z 
2025-05-07T20:32:14.6896224Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6896356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6896460Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6896559Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6897101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6897203Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6897590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6897824Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6898185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6898280Z     kernel = self.compile(
2025-05-07T20:32:14.6898687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6898866Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6899001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6899005Z 
2025-05-07T20:32:14.6899215Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef9f62b0>
2025-05-07T20:32:14.6900221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6900765Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefc69820>}
2025-05-07T20:32:14.6901573Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6901770Z context = <triton._C.libtriton.ir.context object at 0x7f1cefa55af0>
2025-05-07T20:32:14.6901774Z 
2025-05-07T20:32:14.6901941Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6902216Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6902329Z                            module_map=module_map)
2025-05-07T20:32:14.6902497Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6902600Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6902682Z E       ^
2025-05-07T20:32:14.6903065Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6903072Z 
2025-05-07T20:32:14.6903516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6903520Z 
2025-05-07T20:32:14.6903620Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6903853Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6903932Z     T=2048,
2025-05-07T20:32:14.6904007Z     D=7168,
2025-05-07T20:32:14.6904092Z     scale_ub=None,
2025-05-07T20:32:14.6904182Z     contiguous=False,
2025-05-07T20:32:14.6904264Z     compiled=True,
2025-05-07T20:32:14.6904343Z )
2025-05-07T20:32:14.6904643Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6904826Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.6904831Z 
2025-05-07T20:32:14.6904908Z     @given(
2025-05-07T20:32:14.6905027Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6905129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6905247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6905364Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6905482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6905561Z     )
2025-05-07T20:32:14.6905817Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6905914Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6905995Z         self,
2025-05-07T20:32:14.6906075Z         T: int,
2025-05-07T20:32:14.6906149Z         D: int,
2025-05-07T20:32:14.6906246Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6906341Z         contiguous: bool,
2025-05-07T20:32:14.6906423Z         compiled: bool,
2025-05-07T20:32:14.6906498Z     ) -> None:
2025-05-07T20:32:14.6906597Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6906671Z     
2025-05-07T20:32:14.6906838Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6906913Z     
2025-05-07T20:32:14.6907003Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6907125Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6907218Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6907297Z         x0 = x[:, :D]
2025-05-07T20:32:14.6907379Z         x1 = x[:, D:]
2025-05-07T20:32:14.6907450Z     
2025-05-07T20:32:14.6907529Z         if contiguous:
2025-05-07T20:32:14.6907624Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6907793Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6907864Z     
2025-05-07T20:32:14.6907954Z         if scale_ub is not None:
2025-05-07T20:32:14.6908058Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6908202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6908282Z             )
2025-05-07T20:32:14.6908354Z         else:
2025-05-07T20:32:14.6908447Z             scale_ub_tensor = None
2025-05-07T20:32:14.6908520Z     
2025-05-07T20:32:14.6908647Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6908735Z             op = silu_mul_quant
2025-05-07T20:32:14.6908822Z             if compiled:
2025-05-07T20:32:14.6908920Z                 op = torch.compile(op)
2025-05-07T20:32:14.6909028Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6909098Z     
2025-05-07T20:32:14.6909187Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6909191Z 
2025-05-07T20:32:14.6909288Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6909424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6909522Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6909628Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6910112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6910207Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6910743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6910839Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6911222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6911457Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6911819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6911919Z     kernel = self.compile(
2025-05-07T20:32:14.6912411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6912593Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6912721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6912725Z 
2025-05-07T20:32:14.6912935Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefa74d90>
2025-05-07T20:32:14.6913787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6914330Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefa93790>}
2025-05-07T20:32:14.6915156Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6915350Z context = <triton._C.libtriton.ir.context object at 0x7f1cefab0570>
2025-05-07T20:32:14.6915354Z 
2025-05-07T20:32:14.6915521Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6915796Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6915901Z                            module_map=module_map)
2025-05-07T20:32:14.6916063Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6916158Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6916231Z E       ^
2025-05-07T20:32:14.6916613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6916697Z 
2025-05-07T20:32:14.6917140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6917151Z 
2025-05-07T20:32:14.6917255Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6917487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6917563Z     T=4096,
2025-05-07T20:32:14.6917638Z     D=7168,
2025-05-07T20:32:14.6917718Z     scale_ub=None,
2025-05-07T20:32:14.6917804Z     contiguous=False,
2025-05-07T20:32:14.6917890Z     compiled=True,
2025-05-07T20:32:14.6917959Z )
2025-05-07T20:32:14.6918182Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6918364Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.6918369Z 
2025-05-07T20:32:14.6918443Z     @given(
2025-05-07T20:32:14.6918562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6918667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6918781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6918903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6919013Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6919086Z     )
2025-05-07T20:32:14.6919344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6919435Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6919509Z         self,
2025-05-07T20:32:14.6919587Z         T: int,
2025-05-07T20:32:14.6919660Z         D: int,
2025-05-07T20:32:14.6919757Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6919844Z         contiguous: bool,
2025-05-07T20:32:14.6919927Z         compiled: bool,
2025-05-07T20:32:14.6920005Z     ) -> None:
2025-05-07T20:32:14.6920097Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6920170Z     
2025-05-07T20:32:14.6920343Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6920421Z     
2025-05-07T20:32:14.6920510Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6920636Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6920803Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6920882Z         x0 = x[:, :D]
2025-05-07T20:32:14.6920962Z         x1 = x[:, D:]
2025-05-07T20:32:14.6921032Z     
2025-05-07T20:32:14.6921113Z         if contiguous:
2025-05-07T20:32:14.6921205Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6921294Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6921369Z     
2025-05-07T20:32:14.6921458Z         if scale_ub is not None:
2025-05-07T20:32:14.6921564Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6921701Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6921774Z             )
2025-05-07T20:32:14.6921848Z         else:
2025-05-07T20:32:14.6921948Z             scale_ub_tensor = None
2025-05-07T20:32:14.6922024Z     
2025-05-07T20:32:14.6922149Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6922244Z             op = silu_mul_quant
2025-05-07T20:32:14.6922324Z             if compiled:
2025-05-07T20:32:14.6922429Z                 op = torch.compile(op)
2025-05-07T20:32:14.6922536Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6922607Z     
2025-05-07T20:32:14.6922698Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6922703Z 
2025-05-07T20:32:14.6922799Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6922929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6923032Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6923127Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6923515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6923608Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6924142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6924323Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6924708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6924940Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6925304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6925397Z     kernel = self.compile(
2025-05-07T20:32:14.6925803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6925984Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6926112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6926117Z 
2025-05-07T20:32:14.6926333Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef97bbb0>
2025-05-07T20:32:14.6927182Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6927728Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef9944c0>}
2025-05-07T20:32:14.6928537Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6928732Z context = <triton._C.libtriton.ir.context object at 0x7f1cef9bbd30>
2025-05-07T20:32:14.6928737Z 
2025-05-07T20:32:14.6928906Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6929183Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6929289Z                            module_map=module_map)
2025-05-07T20:32:14.6929547Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6929646Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6929725Z E       ^
2025-05-07T20:32:14.6930104Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6930109Z 
2025-05-07T20:32:14.6930552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6930560Z 
2025-05-07T20:32:14.6930662Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6930895Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6930973Z     T=16384,
2025-05-07T20:32:14.6931052Z     D=5120,
2025-05-07T20:32:14.6931133Z     scale_ub=1200.0,
2025-05-07T20:32:14.6931221Z     contiguous=False,
2025-05-07T20:32:14.6931305Z     compiled=False,
2025-05-07T20:32:14.6931373Z )
2025-05-07T20:32:14.6931603Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6931788Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.6931793Z 
2025-05-07T20:32:14.6931867Z     @given(
2025-05-07T20:32:14.6931986Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6932082Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6932199Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6932313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6932424Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6932499Z     )
2025-05-07T20:32:14.6932756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6932927Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6933011Z         self,
2025-05-07T20:32:14.6936282Z         T: int,
2025-05-07T20:32:14.6936374Z         D: int,
2025-05-07T20:32:14.6936487Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6936584Z         contiguous: bool,
2025-05-07T20:32:14.6936671Z         compiled: bool,
2025-05-07T20:32:14.6936753Z     ) -> None:
2025-05-07T20:32:14.6936849Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6936923Z     
2025-05-07T20:32:14.6937097Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6937173Z     
2025-05-07T20:32:14.6937265Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6937387Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6937478Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6937558Z         x0 = x[:, :D]
2025-05-07T20:32:14.6937635Z         x1 = x[:, D:]
2025-05-07T20:32:14.6937715Z     
2025-05-07T20:32:14.6937796Z         if contiguous:
2025-05-07T20:32:14.6937894Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6937983Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6938056Z     
2025-05-07T20:32:14.6938149Z         if scale_ub is not None:
2025-05-07T20:32:14.6938257Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6938392Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6938470Z             )
2025-05-07T20:32:14.6938546Z         else:
2025-05-07T20:32:14.6938638Z             scale_ub_tensor = None
2025-05-07T20:32:14.6938711Z     
2025-05-07T20:32:14.6938841Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6938931Z             op = silu_mul_quant
2025-05-07T20:32:14.6939015Z             if compiled:
2025-05-07T20:32:14.6939115Z                 op = torch.compile(op)
2025-05-07T20:32:14.6939218Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6939295Z     
2025-05-07T20:32:14.6939384Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6939395Z 
2025-05-07T20:32:14.6939500Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6939632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6939729Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6939932Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6940487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6940581Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6940971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6941203Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6941568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6941660Z     kernel = self.compile(
2025-05-07T20:32:14.6942074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6942265Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6942402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6942407Z 
2025-05-07T20:32:14.6942622Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefa2e580>
2025-05-07T20:32:14.6943466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6944008Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef994820>}
2025-05-07T20:32:14.6944827Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6945107Z context = <triton._C.libtriton.ir.context object at 0x7f1cef81b570>
2025-05-07T20:32:14.6945116Z 
2025-05-07T20:32:14.6945289Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6945564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6945671Z                            module_map=module_map)
2025-05-07T20:32:14.6945839Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6945938Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6946020Z E       ^
2025-05-07T20:32:14.6946404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6946409Z 
2025-05-07T20:32:14.6946852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6946863Z 
2025-05-07T20:32:14.6946971Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6947209Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6947290Z     T=16384,
2025-05-07T20:32:14.6947367Z     D=5120,
2025-05-07T20:32:14.6947448Z     scale_ub=1200.0,
2025-05-07T20:32:14.6947539Z     contiguous=True,
2025-05-07T20:32:14.6947621Z     compiled=True,
2025-05-07T20:32:14.6947698Z )
2025-05-07T20:32:14.6947927Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6948106Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.6948111Z 
2025-05-07T20:32:14.6948187Z     @given(
2025-05-07T20:32:14.6948309Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6948406Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6948519Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6948648Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6948763Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6948837Z     )
2025-05-07T20:32:14.6949173Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6949275Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6949354Z         self,
2025-05-07T20:32:14.6949434Z         T: int,
2025-05-07T20:32:14.6949513Z         D: int,
2025-05-07T20:32:14.6949615Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6949793Z         contiguous: bool,
2025-05-07T20:32:14.6949880Z         compiled: bool,
2025-05-07T20:32:14.6949962Z     ) -> None:
2025-05-07T20:32:14.6950054Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6950125Z     
2025-05-07T20:32:14.6950301Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6950375Z     
2025-05-07T20:32:14.6950464Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6950600Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6950688Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6950772Z         x0 = x[:, :D]
2025-05-07T20:32:14.6950858Z         x1 = x[:, D:]
2025-05-07T20:32:14.6950932Z     
2025-05-07T20:32:14.6951014Z         if contiguous:
2025-05-07T20:32:14.6951104Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6951193Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6951269Z     
2025-05-07T20:32:14.6951359Z         if scale_ub is not None:
2025-05-07T20:32:14.6951465Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6951610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6951687Z             )
2025-05-07T20:32:14.6951764Z         else:
2025-05-07T20:32:14.6951862Z             scale_ub_tensor = None
2025-05-07T20:32:14.6951934Z     
2025-05-07T20:32:14.6952063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6952154Z             op = silu_mul_quant
2025-05-07T20:32:14.6952325Z             if compiled:
2025-05-07T20:32:14.6952428Z                 op = torch.compile(op)
2025-05-07T20:32:14.6952534Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6952610Z     
2025-05-07T20:32:14.6952704Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6952708Z 
2025-05-07T20:32:14.6952803Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6952934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6953043Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6953143Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6953531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6953626Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6954163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6954270Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6954649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6954884Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6955250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6955342Z     kernel = self.compile(
2025-05-07T20:32:14.6955752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6955929Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6956057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6956062Z 
2025-05-07T20:32:14.6956274Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef836280>
2025-05-07T20:32:14.6957117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6957754Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef814e50>}
2025-05-07T20:32:14.6958564Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6958761Z context = <triton._C.libtriton.ir.context object at 0x7f1cef795d30>
2025-05-07T20:32:14.6958773Z 
2025-05-07T20:32:14.6958939Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6959211Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6959323Z                            module_map=module_map)
2025-05-07T20:32:14.6959484Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6959588Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6959674Z E       ^
2025-05-07T20:32:14.6960055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6960059Z 
2025-05-07T20:32:14.6960505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6960510Z 
2025-05-07T20:32:14.6960610Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6960841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6960928Z     T=16384,
2025-05-07T20:32:14.6961009Z     D=5120,
2025-05-07T20:32:14.6961092Z     scale_ub=None,
2025-05-07T20:32:14.6961183Z     contiguous=False,
2025-05-07T20:32:14.6961376Z     compiled=True,
2025-05-07T20:32:14.6961450Z )
2025-05-07T20:32:14.6961676Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6961864Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.6961868Z 
2025-05-07T20:32:14.6961950Z     @given(
2025-05-07T20:32:14.6962066Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6962168Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6962292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6962408Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6962522Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6962603Z     )
2025-05-07T20:32:14.6962857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6962949Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6963029Z         self,
2025-05-07T20:32:14.6963108Z         T: int,
2025-05-07T20:32:14.6963193Z         D: int,
2025-05-07T20:32:14.6963289Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6963378Z         contiguous: bool,
2025-05-07T20:32:14.6963468Z         compiled: bool,
2025-05-07T20:32:14.6963550Z     ) -> None:
2025-05-07T20:32:14.6963647Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6963725Z     
2025-05-07T20:32:14.6963894Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6963968Z     
2025-05-07T20:32:14.6964062Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6964184Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6964272Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6964355Z         x0 = x[:, :D]
2025-05-07T20:32:14.6964435Z         x1 = x[:, D:]
2025-05-07T20:32:14.6964508Z     
2025-05-07T20:32:14.6964594Z         if contiguous:
2025-05-07T20:32:14.6964684Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6964777Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6964856Z     
2025-05-07T20:32:14.6964945Z         if scale_ub is not None:
2025-05-07T20:32:14.6965052Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6965184Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6965341Z             )
2025-05-07T20:32:14.6965425Z         else:
2025-05-07T20:32:14.6965519Z             scale_ub_tensor = None
2025-05-07T20:32:14.6965593Z     
2025-05-07T20:32:14.6965725Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6965814Z             op = silu_mul_quant
2025-05-07T20:32:14.6965901Z             if compiled:
2025-05-07T20:32:14.6966006Z                 op = torch.compile(op)
2025-05-07T20:32:14.6966110Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6966188Z     
2025-05-07T20:32:14.6966276Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6966281Z 
2025-05-07T20:32:14.6966376Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6966509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6966614Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6966714Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6967117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6967210Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6967751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6967847Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6968227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6968465Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6968825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6968918Z     kernel = self.compile(
2025-05-07T20:32:14.6969415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6969597Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6969731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6969735Z 
2025-05-07T20:32:14.6969945Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef79ff10>
2025-05-07T20:32:14.6970791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6971334Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cefa029d0>}
2025-05-07T20:32:14.6972141Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6972349Z context = <triton._C.libtriton.ir.context object at 0x7f1cef8433f0>
2025-05-07T20:32:14.6972353Z 
2025-05-07T20:32:14.6972521Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6972796Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6972900Z                            module_map=module_map)
2025-05-07T20:32:14.6973062Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6973162Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6973237Z E       ^
2025-05-07T20:32:14.6973615Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6973620Z 
2025-05-07T20:32:14.6974071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6974076Z 
2025-05-07T20:32:14.6974177Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6974486Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6974566Z     T=2048,
2025-05-07T20:32:14.6974644Z     D=5120,
2025-05-07T20:32:14.6974730Z     scale_ub=None,
2025-05-07T20:32:14.6974819Z     contiguous=False,
2025-05-07T20:32:14.6974899Z     compiled=True,
2025-05-07T20:32:14.6974974Z )
2025-05-07T20:32:14.6975196Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6975373Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.6975382Z 
2025-05-07T20:32:14.6975460Z     @given(
2025-05-07T20:32:14.6975577Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6975681Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6975799Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6975914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6976030Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6976109Z     )
2025-05-07T20:32:14.6976367Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6976465Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6976542Z         self,
2025-05-07T20:32:14.6976620Z         T: int,
2025-05-07T20:32:14.6976700Z         D: int,
2025-05-07T20:32:14.6976797Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6976887Z         contiguous: bool,
2025-05-07T20:32:14.6976972Z         compiled: bool,
2025-05-07T20:32:14.6977050Z     ) -> None:
2025-05-07T20:32:14.6977147Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6977220Z     
2025-05-07T20:32:14.6977389Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6977466Z     
2025-05-07T20:32:14.6977637Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6977759Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6977847Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6977931Z         x0 = x[:, :D]
2025-05-07T20:32:14.6978010Z         x1 = x[:, D:]
2025-05-07T20:32:14.6978086Z     
2025-05-07T20:32:14.6978167Z         if contiguous:
2025-05-07T20:32:14.6978259Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6978350Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6978425Z     
2025-05-07T20:32:14.6978517Z         if scale_ub is not None:
2025-05-07T20:32:14.6978623Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6978757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6978832Z             )
2025-05-07T20:32:14.6978909Z         else:
2025-05-07T20:32:14.6979004Z             scale_ub_tensor = None
2025-05-07T20:32:14.6979081Z     
2025-05-07T20:32:14.6979208Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6979302Z             op = silu_mul_quant
2025-05-07T20:32:14.6979391Z             if compiled:
2025-05-07T20:32:14.6979491Z                 op = torch.compile(op)
2025-05-07T20:32:14.6979598Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6979675Z     
2025-05-07T20:32:14.6979765Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6979769Z 
2025-05-07T20:32:14.6979871Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6980001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6980100Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6980204Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6980596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6980693Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6981233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6981334Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6981805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6982041Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6982403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6982500Z     kernel = self.compile(
2025-05-07T20:32:14.6983165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6983362Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6983494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6983499Z 
2025-05-07T20:32:14.6983708Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef6b2a30>
2025-05-07T20:32:14.6984569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6985115Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef8b3550>}
2025-05-07T20:32:14.6985929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6986123Z context = <triton._C.libtriton.ir.context object at 0x7f1cef88ca30>
2025-05-07T20:32:14.6986128Z 
2025-05-07T20:32:14.6986293Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.6986569Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.6986822Z                            module_map=module_map)
2025-05-07T20:32:14.6986988Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.6987091Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.6987164Z E       ^
2025-05-07T20:32:14.6987546Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.6987551Z 
2025-05-07T20:32:14.6987995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.6988000Z 
2025-05-07T20:32:14.6988105Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.6988336Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.6988411Z     T=2048,
2025-05-07T20:32:14.6988491Z     D=5120,
2025-05-07T20:32:14.6988572Z     scale_ub=1200.0,
2025-05-07T20:32:14.6988662Z     contiguous=False,
2025-05-07T20:32:14.6988745Z     compiled=True,
2025-05-07T20:32:14.6988816Z )
2025-05-07T20:32:14.6989039Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.6989226Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.6989231Z 
2025-05-07T20:32:14.6989306Z     @given(
2025-05-07T20:32:14.6989422Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.6989519Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.6989634Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.6989816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.6989926Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.6989999Z     )
2025-05-07T20:32:14.6990258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.6990349Z     def test_silu_mul_quant(
2025-05-07T20:32:14.6990425Z         self,
2025-05-07T20:32:14.6990509Z         T: int,
2025-05-07T20:32:14.6990583Z         D: int,
2025-05-07T20:32:14.6990680Z         scale_ub: Optional[float],
2025-05-07T20:32:14.6990770Z         contiguous: bool,
2025-05-07T20:32:14.6990969Z         compiled: bool,
2025-05-07T20:32:14.6991052Z     ) -> None:
2025-05-07T20:32:14.6991146Z         torch.manual_seed(2025)
2025-05-07T20:32:14.6991218Z     
2025-05-07T20:32:14.6991393Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.6991468Z     
2025-05-07T20:32:14.6991560Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.6991683Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.6991771Z         x = x_sign * x_clamp
2025-05-07T20:32:14.6991850Z         x0 = x[:, :D]
2025-05-07T20:32:14.6991931Z         x1 = x[:, D:]
2025-05-07T20:32:14.6992001Z     
2025-05-07T20:32:14.6992081Z         if contiguous:
2025-05-07T20:32:14.6992176Z             x0 = x0.contiguous()
2025-05-07T20:32:14.6992269Z             x1 = x1.contiguous()
2025-05-07T20:32:14.6992338Z     
2025-05-07T20:32:14.6992431Z         if scale_ub is not None:
2025-05-07T20:32:14.6992536Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.6992677Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.6992751Z             )
2025-05-07T20:32:14.6992824Z         else:
2025-05-07T20:32:14.6992919Z             scale_ub_tensor = None
2025-05-07T20:32:14.6992989Z     
2025-05-07T20:32:14.6993115Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.6993207Z             op = silu_mul_quant
2025-05-07T20:32:14.6993291Z             if compiled:
2025-05-07T20:32:14.6993390Z                 op = torch.compile(op)
2025-05-07T20:32:14.6993499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6993570Z     
2025-05-07T20:32:14.6993660Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.6993670Z 
2025-05-07T20:32:14.6993768Z moe/activation_test.py:117: 
2025-05-07T20:32:14.6994009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6994113Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.6994211Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.6994605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.6994705Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.6995239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.6995336Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.6995725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.6995956Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.6996318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.6996416Z     kernel = self.compile(
2025-05-07T20:32:14.6996830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.6997013Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.6997147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.6997151Z 
2025-05-07T20:32:14.6997361Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cefba2610>
2025-05-07T20:32:14.6998210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.6998756Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef654310>}
2025-05-07T20:32:14.6999658Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.6999853Z context = <triton._C.libtriton.ir.context object at 0x7f1cef66e030>
2025-05-07T20:32:14.6999858Z 
2025-05-07T20:32:14.7000025Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7000300Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7000405Z                            module_map=module_map)
2025-05-07T20:32:14.7000573Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7000670Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7000746Z E       ^
2025-05-07T20:32:14.7001127Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7001136Z 
2025-05-07T20:32:14.7001579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7001583Z 
2025-05-07T20:32:14.7001692Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7001922Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7001999Z     T=4096,
2025-05-07T20:32:14.7002079Z     D=5120,
2025-05-07T20:32:14.7002159Z     scale_ub=1200.0,
2025-05-07T20:32:14.7002242Z     contiguous=True,
2025-05-07T20:32:14.7002328Z     compiled=True,
2025-05-07T20:32:14.7002401Z )
2025-05-07T20:32:14.7002624Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7002805Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.7002809Z 
2025-05-07T20:32:14.7002888Z     @given(
2025-05-07T20:32:14.7003004Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7003188Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7003302Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7003420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7003536Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7003612Z     )
2025-05-07T20:32:14.7003873Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7003965Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7004037Z         self,
2025-05-07T20:32:14.7004113Z         T: int,
2025-05-07T20:32:14.7004186Z         D: int,
2025-05-07T20:32:14.7004281Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7004372Z         contiguous: bool,
2025-05-07T20:32:14.7004455Z         compiled: bool,
2025-05-07T20:32:14.7004537Z     ) -> None:
2025-05-07T20:32:14.7004628Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7004700Z     
2025-05-07T20:32:14.7004873Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7004950Z     
2025-05-07T20:32:14.7005040Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7005164Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7005253Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7005335Z         x0 = x[:, :D]
2025-05-07T20:32:14.7005415Z         x1 = x[:, D:]
2025-05-07T20:32:14.7005487Z     
2025-05-07T20:32:14.7005567Z         if contiguous:
2025-05-07T20:32:14.7005660Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7005752Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7005823Z     
2025-05-07T20:32:14.7005914Z         if scale_ub is not None:
2025-05-07T20:32:14.7006018Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7006154Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7006231Z             )
2025-05-07T20:32:14.7006306Z         else:
2025-05-07T20:32:14.7006402Z             scale_ub_tensor = None
2025-05-07T20:32:14.7006473Z     
2025-05-07T20:32:14.7006605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7006697Z             op = silu_mul_quant
2025-05-07T20:32:14.7006781Z             if compiled:
2025-05-07T20:32:14.7006964Z                 op = torch.compile(op)
2025-05-07T20:32:14.7007072Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7007143Z     
2025-05-07T20:32:14.7007233Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7007242Z 
2025-05-07T20:32:14.7007337Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7007467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7007569Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7007664Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7008056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.7008154Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.7008687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7008787Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7009174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7009403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7009765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7009858Z     kernel = self.compile(
2025-05-07T20:32:14.7010264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7010442Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7010569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7010573Z 
2025-05-07T20:32:14.7010867Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef667c40>
2025-05-07T20:32:14.7011716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7012261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef73e040>}
2025-05-07T20:32:14.7013079Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7013272Z context = <triton._C.libtriton.ir.context object at 0x7f1cef708cb0>
2025-05-07T20:32:14.7013277Z 
2025-05-07T20:32:14.7013448Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7013725Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7013832Z                            module_map=module_map)
2025-05-07T20:32:14.7013999Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7014098Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7014178Z E       ^
2025-05-07T20:32:14.7014554Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7014558Z 
2025-05-07T20:32:14.7014999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7015004Z 
2025-05-07T20:32:14.7015110Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7015340Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7015422Z     T=128,
2025-05-07T20:32:14.7015498Z     D=5120,
2025-05-07T20:32:14.7015583Z     scale_ub=1200.0,
2025-05-07T20:32:14.7015674Z     contiguous=False,
2025-05-07T20:32:14.7015754Z     compiled=True,
2025-05-07T20:32:14.7015827Z )
2025-05-07T20:32:14.7016134Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7016310Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.7016315Z 
2025-05-07T20:32:14.7016393Z     @given(
2025-05-07T20:32:14.7016514Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7016609Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7016723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7016841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7016950Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7017024Z     )
2025-05-07T20:32:14.7017279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7017371Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7017451Z         self,
2025-05-07T20:32:14.7017524Z         T: int,
2025-05-07T20:32:14.7017599Z         D: int,
2025-05-07T20:32:14.7017704Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7017794Z         contiguous: bool,
2025-05-07T20:32:14.7017875Z         compiled: bool,
2025-05-07T20:32:14.7017956Z     ) -> None:
2025-05-07T20:32:14.7018048Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7018121Z     
2025-05-07T20:32:14.7018298Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7018370Z     
2025-05-07T20:32:14.7018461Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7018583Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7018668Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7018750Z         x0 = x[:, :D]
2025-05-07T20:32:14.7018829Z         x1 = x[:, D:]
2025-05-07T20:32:14.7018897Z     
2025-05-07T20:32:14.7018980Z         if contiguous:
2025-05-07T20:32:14.7019071Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7019241Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7019315Z     
2025-05-07T20:32:14.7019405Z         if scale_ub is not None:
2025-05-07T20:32:14.7019515Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7019654Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7019726Z             )
2025-05-07T20:32:14.7019811Z         else:
2025-05-07T20:32:14.7019902Z             scale_ub_tensor = None
2025-05-07T20:32:14.7019975Z     
2025-05-07T20:32:14.7020106Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7020194Z             op = silu_mul_quant
2025-05-07T20:32:14.7020276Z             if compiled:
2025-05-07T20:32:14.7020377Z                 op = torch.compile(op)
2025-05-07T20:32:14.7020480Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7020553Z     
2025-05-07T20:32:14.7020647Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7020652Z 
2025-05-07T20:32:14.7020751Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7020884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7020983Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7021083Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7021476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.7021567Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.7022101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7022199Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7022577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7022812Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7023172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7023269Z     kernel = self.compile(
2025-05-07T20:32:14.7023786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7023965Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7024094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7024099Z 
2025-05-07T20:32:14.7024311Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef727fa0>
2025-05-07T20:32:14.7025156Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7025704Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef73eca0>}
2025-05-07T20:32:14.7026522Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7026720Z context = <triton._C.libtriton.ir.context object at 0x7f1cef60fc70>
2025-05-07T20:32:14.7026724Z 
2025-05-07T20:32:14.7026890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7027163Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7027273Z                            module_map=module_map)
2025-05-07T20:32:14.7027435Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7027533Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7027611Z E       ^
2025-05-07T20:32:14.7027988Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7028072Z 
2025-05-07T20:32:14.7028524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7028529Z 
2025-05-07T20:32:14.7028631Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7028860Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7028940Z     T=16384,
2025-05-07T20:32:14.7029015Z     D=7168,
2025-05-07T20:32:14.7029097Z     scale_ub=1200.0,
2025-05-07T20:32:14.7029185Z     contiguous=True,
2025-05-07T20:32:14.7029267Z     compiled=True,
2025-05-07T20:32:14.7029344Z )
2025-05-07T20:32:14.7029567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7029855Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.7029861Z 
2025-05-07T20:32:14.7029950Z     @given(
2025-05-07T20:32:14.7030088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7030189Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7030307Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7030425Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7030541Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7030615Z     )
2025-05-07T20:32:14.7030871Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7030968Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7031044Z         self,
2025-05-07T20:32:14.7031120Z         T: int,
2025-05-07T20:32:14.7031197Z         D: int,
2025-05-07T20:32:14.7031295Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7031383Z         contiguous: bool,
2025-05-07T20:32:14.7031472Z         compiled: bool,
2025-05-07T20:32:14.7031551Z     ) -> None:
2025-05-07T20:32:14.7031642Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7031722Z     
2025-05-07T20:32:14.7031901Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7031970Z     
2025-05-07T20:32:14.7032064Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7032342Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7032434Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7032513Z         x0 = x[:, :D]
2025-05-07T20:32:14.7032590Z         x1 = x[:, D:]
2025-05-07T20:32:14.7032671Z     
2025-05-07T20:32:14.7032753Z         if contiguous:
2025-05-07T20:32:14.7032843Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7032936Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7033011Z     
2025-05-07T20:32:14.7033101Z         if scale_ub is not None:
2025-05-07T20:32:14.7033210Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7033341Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7033417Z             )
2025-05-07T20:32:14.7033494Z         else:
2025-05-07T20:32:14.7033585Z             scale_ub_tensor = None
2025-05-07T20:32:14.7033664Z     
2025-05-07T20:32:14.7033789Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7033877Z             op = silu_mul_quant
2025-05-07T20:32:14.7033966Z             if compiled:
2025-05-07T20:32:14.7034064Z                 op = torch.compile(op)
2025-05-07T20:32:14.7034167Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7034245Z     
2025-05-07T20:32:14.7034333Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7034338Z 
2025-05-07T20:32:14.7034433Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7034565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7034662Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7034761Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7035151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.7035243Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.7035869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7035964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7036348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7036581Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7036942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7037038Z     kernel = self.compile(
2025-05-07T20:32:14.7037444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7037621Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7037751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7037762Z 
2025-05-07T20:32:14.7037971Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef62e9a0>
2025-05-07T20:32:14.7038821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7039362Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef7fea60>}
2025-05-07T20:32:14.7040175Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7040372Z context = <triton._C.libtriton.ir.context object at 0x7f1cef4782f0>
2025-05-07T20:32:14.7040376Z 
2025-05-07T20:32:14.7040543Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7040822Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7041001Z                            module_map=module_map)
2025-05-07T20:32:14.7041163Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7041263Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7041339Z E       ^
2025-05-07T20:32:14.7041718Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7041726Z 
2025-05-07T20:32:14.7042169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7042174Z 
2025-05-07T20:32:14.7042272Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7042509Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7042584Z     T=16384,
2025-05-07T20:32:14.7042661Z     D=5120,
2025-05-07T20:32:14.7042744Z     scale_ub=1200.0,
2025-05-07T20:32:14.7042825Z     contiguous=True,
2025-05-07T20:32:14.7042908Z     compiled=False,
2025-05-07T20:32:14.7042983Z )
2025-05-07T20:32:14.7043210Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7043396Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.7043401Z 
2025-05-07T20:32:14.7043480Z     @given(
2025-05-07T20:32:14.7043597Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7043700Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7043814Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7043927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7044041Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7044114Z     )
2025-05-07T20:32:14.7044375Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7044546Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7044621Z         self,
2025-05-07T20:32:14.7044698Z         T: int,
2025-05-07T20:32:14.7044771Z         D: int,
2025-05-07T20:32:14.7044875Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7044966Z         contiguous: bool,
2025-05-07T20:32:14.7045050Z         compiled: bool,
2025-05-07T20:32:14.7045125Z     ) -> None:
2025-05-07T20:32:14.7045221Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7045294Z     
2025-05-07T20:32:14.7045463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7045537Z     
2025-05-07T20:32:14.7045626Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7045748Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7045840Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7045920Z         x0 = x[:, :D]
2025-05-07T20:32:14.7045999Z         x1 = x[:, D:]
2025-05-07T20:32:14.7046071Z     
2025-05-07T20:32:14.7046159Z         if contiguous:
2025-05-07T20:32:14.7046255Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7046342Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7046417Z     
2025-05-07T20:32:14.7046513Z         if scale_ub is not None:
2025-05-07T20:32:14.7046616Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7046749Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7046826Z             )
2025-05-07T20:32:14.7046901Z         else:
2025-05-07T20:32:14.7046995Z             scale_ub_tensor = None
2025-05-07T20:32:14.7047069Z     
2025-05-07T20:32:14.7047198Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7047290Z             op = silu_mul_quant
2025-05-07T20:32:14.7047372Z             if compiled:
2025-05-07T20:32:14.7047470Z                 op = torch.compile(op)
2025-05-07T20:32:14.7047582Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7047656Z     
2025-05-07T20:32:14.7047744Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7047754Z 
2025-05-07T20:32:14.7047854Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7047985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7048166Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7048268Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7048803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7048904Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7049285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7049517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7049885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7049976Z     kernel = self.compile(
2025-05-07T20:32:14.7050391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7050572Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7050703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7050708Z 
2025-05-07T20:32:14.7050925Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef453f40>
2025-05-07T20:32:14.7051769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7052317Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef6ec550>}
2025-05-07T20:32:14.7053126Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7053408Z context = <triton._C.libtriton.ir.context object at 0x7f1cef6e9d30>
2025-05-07T20:32:14.7053412Z 
2025-05-07T20:32:14.7053588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7053863Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7053974Z                            module_map=module_map)
2025-05-07T20:32:14.7054134Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7054230Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7054305Z E       ^
2025-05-07T20:32:14.7054682Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7054686Z 
2025-05-07T20:32:14.7055128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7055140Z 
2025-05-07T20:32:14.7055241Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7055475Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7055556Z     T=1,
2025-05-07T20:32:14.7055630Z     D=7168,
2025-05-07T20:32:14.7055710Z     scale_ub=1200.0,
2025-05-07T20:32:14.7055796Z     contiguous=False,
2025-05-07T20:32:14.7055875Z     compiled=False,
2025-05-07T20:32:14.7055945Z )
2025-05-07T20:32:14.7056176Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7056346Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.7056351Z 
2025-05-07T20:32:14.7056424Z     @given(
2025-05-07T20:32:14.7056547Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7056642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7056764Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7056878Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7056989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7057207Z     )
2025-05-07T20:32:14.7057467Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7057557Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7057639Z         self,
2025-05-07T20:32:14.7057714Z         T: int,
2025-05-07T20:32:14.7057789Z         D: int,
2025-05-07T20:32:14.7057889Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7057975Z         contiguous: bool,
2025-05-07T20:32:14.7058063Z         compiled: bool,
2025-05-07T20:32:14.7058140Z     ) -> None:
2025-05-07T20:32:14.7058232Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7058305Z     
2025-05-07T20:32:14.7058476Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7058549Z     
2025-05-07T20:32:14.7058650Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7061946Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7062052Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7062137Z         x0 = x[:, :D]
2025-05-07T20:32:14.7062226Z         x1 = x[:, D:]
2025-05-07T20:32:14.7062305Z     
2025-05-07T20:32:14.7062388Z         if contiguous:
2025-05-07T20:32:14.7062484Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7062574Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7062647Z     
2025-05-07T20:32:14.7062741Z         if scale_ub is not None:
2025-05-07T20:32:14.7062846Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7062982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7063056Z             )
2025-05-07T20:32:14.7063131Z         else:
2025-05-07T20:32:14.7063222Z             scale_ub_tensor = None
2025-05-07T20:32:14.7063300Z     
2025-05-07T20:32:14.7063429Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7063625Z             op = silu_mul_quant
2025-05-07T20:32:14.7063713Z             if compiled:
2025-05-07T20:32:14.7063816Z                 op = torch.compile(op)
2025-05-07T20:32:14.7063929Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7064001Z     
2025-05-07T20:32:14.7064091Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7064097Z 
2025-05-07T20:32:14.7064199Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7064329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7064432Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7064533Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7065079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7065177Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7065563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7065798Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7066166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7066260Z     kernel = self.compile(
2025-05-07T20:32:14.7066670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7066854Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7066984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7066988Z 
2025-05-07T20:32:14.7067202Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef45c460>
2025-05-07T20:32:14.7068051Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7068680Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef7fee50>}
2025-05-07T20:32:14.7069502Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7069797Z context = <triton._C.libtriton.ir.context object at 0x7f1cef417b70>
2025-05-07T20:32:14.7069802Z 
2025-05-07T20:32:14.7069973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7070247Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7070353Z                            module_map=module_map)
2025-05-07T20:32:14.7070520Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7070621Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7070702Z E       ^
2025-05-07T20:32:14.7071089Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7071094Z 
2025-05-07T20:32:14.7071537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7071542Z 
2025-05-07T20:32:14.7071645Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7071875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7071957Z     T=4096,
2025-05-07T20:32:14.7072035Z     D=7168,
2025-05-07T20:32:14.7072116Z     scale_ub=1200.0,
2025-05-07T20:32:14.7072202Z     contiguous=False,
2025-05-07T20:32:14.7072285Z     compiled=True,
2025-05-07T20:32:14.7072357Z )
2025-05-07T20:32:14.7072587Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7072850Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.7072855Z 
2025-05-07T20:32:14.7072927Z     @given(
2025-05-07T20:32:14.7073054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7073152Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7073266Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7073384Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7073496Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7073574Z     )
2025-05-07T20:32:14.7073833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7073926Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7074005Z         self,
2025-05-07T20:32:14.7074080Z         T: int,
2025-05-07T20:32:14.7074155Z         D: int,
2025-05-07T20:32:14.7074256Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7074344Z         contiguous: bool,
2025-05-07T20:32:14.7074429Z         compiled: bool,
2025-05-07T20:32:14.7074511Z     ) -> None:
2025-05-07T20:32:14.7074605Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7074677Z     
2025-05-07T20:32:14.7074854Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7074928Z     
2025-05-07T20:32:14.7075018Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7075148Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7075239Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7075319Z         x0 = x[:, :D]
2025-05-07T20:32:14.7075395Z         x1 = x[:, D:]
2025-05-07T20:32:14.7075468Z     
2025-05-07T20:32:14.7075552Z         if contiguous:
2025-05-07T20:32:14.7075641Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7075728Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7075803Z     
2025-05-07T20:32:14.7075895Z         if scale_ub is not None:
2025-05-07T20:32:14.7075998Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7076141Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7076216Z             )
2025-05-07T20:32:14.7076290Z         else:
2025-05-07T20:32:14.7076385Z             scale_ub_tensor = None
2025-05-07T20:32:14.7076545Z     
2025-05-07T20:32:14.7076676Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7076768Z             op = silu_mul_quant
2025-05-07T20:32:14.7076851Z             if compiled:
2025-05-07T20:32:14.7076952Z                 op = torch.compile(op)
2025-05-07T20:32:14.7077057Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7077127Z     
2025-05-07T20:32:14.7077220Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7077225Z 
2025-05-07T20:32:14.7077319Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7077450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7077552Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7077648Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7078046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.7078137Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.7078675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7078774Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7079157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7079387Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7079752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7079843Z     kernel = self.compile(
2025-05-07T20:32:14.7080252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7080513Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7080642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7080647Z 
2025-05-07T20:32:14.7080863Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef43c460>
2025-05-07T20:32:14.7081710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7082259Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef41eee0>}
2025-05-07T20:32:14.7083393Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7083603Z context = <triton._C.libtriton.ir.context object at 0x7f1cef506230>
2025-05-07T20:32:14.7083613Z 
2025-05-07T20:32:14.7083788Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7084067Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7084178Z                            module_map=module_map)
2025-05-07T20:32:14.7084342Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7084443Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7084526Z E       ^
2025-05-07T20:32:14.7084908Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7084912Z 
2025-05-07T20:32:14.7085364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7085369Z 
2025-05-07T20:32:14.7085479Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7085712Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7085798Z     T=128,
2025-05-07T20:32:14.7086026Z     D=7168,
2025-05-07T20:32:14.7086110Z     scale_ub=1200.0,
2025-05-07T20:32:14.7086197Z     contiguous=False,
2025-05-07T20:32:14.7086278Z     compiled=True,
2025-05-07T20:32:14.7086349Z )
2025-05-07T20:32:14.7086576Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7086752Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:14.7086757Z 
2025-05-07T20:32:14.7086836Z     @given(
2025-05-07T20:32:14.7086952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7087049Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7087166Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7087279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7087392Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7087470Z     )
2025-05-07T20:32:14.7087726Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7087823Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7087901Z         self,
2025-05-07T20:32:14.7087979Z         T: int,
2025-05-07T20:32:14.7088059Z         D: int,
2025-05-07T20:32:14.7088156Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7088245Z         contiguous: bool,
2025-05-07T20:32:14.7088335Z         compiled: bool,
2025-05-07T20:32:14.7088414Z     ) -> None:
2025-05-07T20:32:14.7088508Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7088584Z     
2025-05-07T20:32:14.7088754Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7088826Z     
2025-05-07T20:32:14.7088918Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7089044Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7089280Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7089362Z         x0 = x[:, :D]
2025-05-07T20:32:14.7089440Z         x1 = x[:, D:]
2025-05-07T20:32:14.7089516Z     
2025-05-07T20:32:14.7089595Z         if contiguous:
2025-05-07T20:32:14.7089689Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7089778Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7089850Z     
2025-05-07T20:32:14.7089941Z         if scale_ub is not None:
2025-05-07T20:32:14.7090047Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7090181Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7090252Z             )
2025-05-07T20:32:14.7090331Z         else:
2025-05-07T20:32:14.7090423Z             scale_ub_tensor = None
2025-05-07T20:32:14.7090497Z     
2025-05-07T20:32:14.7090627Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7090718Z             op = silu_mul_quant
2025-05-07T20:32:14.7090803Z             if compiled:
2025-05-07T20:32:14.7090904Z                 op = torch.compile(op)
2025-05-07T20:32:14.7091013Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7091088Z     
2025-05-07T20:32:14.7091175Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7091179Z 
2025-05-07T20:32:14.7091280Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7091413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7091511Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7091607Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7092002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.7092093Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.7092634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7092731Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7093109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7093348Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7093790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7093885Z     kernel = self.compile(
2025-05-07T20:32:14.7094294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7094471Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7094602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7094606Z 
2025-05-07T20:32:14.7094817Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef5d42b0>
2025-05-07T20:32:14.7095663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7096220Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef3c9af0>}
2025-05-07T20:32:14.7097030Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7097229Z context = <triton._C.libtriton.ir.context object at 0x7f1cef4bf830>
2025-05-07T20:32:14.7097233Z 
2025-05-07T20:32:14.7097400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7097675Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7097782Z                            module_map=module_map)
2025-05-07T20:32:14.7098023Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7098127Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7098199Z E       ^
2025-05-07T20:32:14.7098581Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7098586Z 
2025-05-07T20:32:14.7099033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7099037Z 
2025-05-07T20:32:14.7099138Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7099370Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7099447Z     T=2048,
2025-05-07T20:32:14.7099520Z     D=7168,
2025-05-07T20:32:14.7099602Z     scale_ub=None,
2025-05-07T20:32:14.7099685Z     contiguous=True,
2025-05-07T20:32:14.7099766Z     compiled=True,
2025-05-07T20:32:14.7099840Z )
2025-05-07T20:32:14.7100063Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7100242Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.7100249Z 
2025-05-07T20:32:14.7100325Z     @given(
2025-05-07T20:32:14.7100447Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7100547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7100660Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7100775Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7100890Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7100962Z     )
2025-05-07T20:32:14.7101217Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7101314Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7101390Z         self,
2025-05-07T20:32:14.7101468Z         T: int,
2025-05-07T20:32:14.7101546Z         D: int,
2025-05-07T20:32:14.7101642Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7101737Z         contiguous: bool,
2025-05-07T20:32:14.7101820Z         compiled: bool,
2025-05-07T20:32:14.7101896Z     ) -> None:
2025-05-07T20:32:14.7101990Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7102062Z     
2025-05-07T20:32:14.7102314Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7102389Z     
2025-05-07T20:32:14.7102477Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7102600Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7102693Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7102770Z         x0 = x[:, :D]
2025-05-07T20:32:14.7102850Z         x1 = x[:, D:]
2025-05-07T20:32:14.7102924Z     
2025-05-07T20:32:14.7103005Z         if contiguous:
2025-05-07T20:32:14.7103092Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7103187Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7103261Z     
2025-05-07T20:32:14.7103356Z         if scale_ub is not None:
2025-05-07T20:32:14.7103460Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7103597Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7103674Z             )
2025-05-07T20:32:14.7103747Z         else:
2025-05-07T20:32:14.7103844Z             scale_ub_tensor = None
2025-05-07T20:32:14.7103919Z     
2025-05-07T20:32:14.7104048Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7104138Z             op = silu_mul_quant
2025-05-07T20:32:14.7104226Z             if compiled:
2025-05-07T20:32:14.7104326Z                 op = torch.compile(op)
2025-05-07T20:32:14.7104429Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7104506Z     
2025-05-07T20:32:14.7104596Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7104601Z 
2025-05-07T20:32:14.7104697Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7104826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7104925Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7105111Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7105500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.7105597Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.7106134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7106229Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7106610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7106839Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7107199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7107299Z     kernel = self.compile(
2025-05-07T20:32:14.7107705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7107889Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7108022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7108027Z 
2025-05-07T20:32:14.7108239Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef4f57c0>
2025-05-07T20:32:14.7109088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7109634Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef4c68b0>}
2025-05-07T20:32:14.7110544Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7110747Z context = <triton._C.libtriton.ir.context object at 0x7f1cef2c11b0>
2025-05-07T20:32:14.7110752Z 
2025-05-07T20:32:14.7111001Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7111281Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7111387Z                            module_map=module_map)
2025-05-07T20:32:14.7111552Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7111648Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7111726Z E       ^
2025-05-07T20:32:14.7112105Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7112110Z 
2025-05-07T20:32:14.7112552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7112561Z 
2025-05-07T20:32:14.7112665Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7112893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7112972Z     T=16384,
2025-05-07T20:32:14.7113050Z     D=5120,
2025-05-07T20:32:14.7113129Z     scale_ub=None,
2025-05-07T20:32:14.7113215Z     contiguous=False,
2025-05-07T20:32:14.7113303Z     compiled=False,
2025-05-07T20:32:14.7113377Z )
2025-05-07T20:32:14.7113599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7113782Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.7113786Z 
2025-05-07T20:32:14.7113866Z     @given(
2025-05-07T20:32:14.7113986Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7114084Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7114196Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7114318Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7114518Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7114592Z     )
2025-05-07T20:32:14.7114854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7114945Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7115020Z         self,
2025-05-07T20:32:14.7115104Z         T: int,
2025-05-07T20:32:14.7115181Z         D: int,
2025-05-07T20:32:14.7115275Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7115364Z         contiguous: bool,
2025-05-07T20:32:14.7115448Z         compiled: bool,
2025-05-07T20:32:14.7115531Z     ) -> None:
2025-05-07T20:32:14.7115624Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7115695Z     
2025-05-07T20:32:14.7115869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7115941Z     
2025-05-07T20:32:14.7116031Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7116159Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7118155Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7118162Z 
2025-05-07T20:32:14.7118281Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:14.7118285Z 
2025-05-07T20:32:14.7118385Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7118612Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7118691Z     T=4096,
2025-05-07T20:32:14.7118767Z     D=7168,
2025-05-07T20:32:14.7118854Z     scale_ub=1200.0,
2025-05-07T20:32:14.7118939Z     contiguous=True,
2025-05-07T20:32:14.7119020Z     compiled=True,
2025-05-07T20:32:14.7119097Z )
2025-05-07T20:32:14.7119399Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7119576Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.7119580Z 
2025-05-07T20:32:14.7119661Z     @given(
2025-05-07T20:32:14.7119777Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7119870Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7119985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7120101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7120216Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7120290Z     )
2025-05-07T20:32:14.7120546Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7120645Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7120720Z         self,
2025-05-07T20:32:14.7120796Z         T: int,
2025-05-07T20:32:14.7120873Z         D: int,
2025-05-07T20:32:14.7120969Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7121058Z         contiguous: bool,
2025-05-07T20:32:14.7121146Z         compiled: bool,
2025-05-07T20:32:14.7121224Z     ) -> None:
2025-05-07T20:32:14.7121317Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7121393Z     
2025-05-07T20:32:14.7121562Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7121640Z     
2025-05-07T20:32:14.7121732Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7121854Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7123826Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7123939Z 
2025-05-07T20:32:14.7124058Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:14.7124063Z 
2025-05-07T20:32:14.7124165Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7124392Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7124469Z     T=16384,
2025-05-07T20:32:14.7124547Z     D=7168,
2025-05-07T20:32:14.7124627Z     scale_ub=None,
2025-05-07T20:32:14.7124714Z     contiguous=False,
2025-05-07T20:32:14.7124798Z     compiled=False,
2025-05-07T20:32:14.7124870Z )
2025-05-07T20:32:14.7125094Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7125288Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.7125293Z 
2025-05-07T20:32:14.7125370Z     @given(
2025-05-07T20:32:14.7125491Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7125589Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7125701Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7125813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7125927Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7125998Z     )
2025-05-07T20:32:14.7126258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7126348Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7126423Z         self,
2025-05-07T20:32:14.7126505Z         T: int,
2025-05-07T20:32:14.7126576Z         D: int,
2025-05-07T20:32:14.7126671Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7126758Z         contiguous: bool,
2025-05-07T20:32:14.7126846Z         compiled: bool,
2025-05-07T20:32:14.7126923Z     ) -> None:
2025-05-07T20:32:14.7127022Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7127098Z     
2025-05-07T20:32:14.7127343Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7129313Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7129319Z 
2025-05-07T20:32:14.7129433Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7129444Z 
2025-05-07T20:32:14.7129542Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7129771Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7129850Z     T=2048,
2025-05-07T20:32:14.7129958Z     D=7168,
2025-05-07T20:32:14.7130072Z     scale_ub=1200.0,
2025-05-07T20:32:14.7130193Z     contiguous=True,
2025-05-07T20:32:14.7130301Z     compiled=True,
2025-05-07T20:32:14.7130379Z )
2025-05-07T20:32:14.7130605Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7130776Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.7130781Z 
2025-05-07T20:32:14.7130861Z     @given(
2025-05-07T20:32:14.7130977Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7131075Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7131190Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7131305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7131507Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7131583Z     )
2025-05-07T20:32:14.7131841Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7131932Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7132013Z         self,
2025-05-07T20:32:14.7132089Z         T: int,
2025-05-07T20:32:14.7132164Z         D: int,
2025-05-07T20:32:14.7132264Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7132351Z         contiguous: bool,
2025-05-07T20:32:14.7132436Z         compiled: bool,
2025-05-07T20:32:14.7132513Z     ) -> None:
2025-05-07T20:32:14.7132605Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7132680Z     
2025-05-07T20:32:14.7132846Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7132918Z     
2025-05-07T20:32:14.7133012Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7133135Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7135098Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7135104Z 
2025-05-07T20:32:14.7135218Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:14.7135223Z 
2025-05-07T20:32:14.7135323Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7135554Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7135631Z     T=2048,
2025-05-07T20:32:14.7135714Z     D=7168,
2025-05-07T20:32:14.7135798Z     scale_ub=None,
2025-05-07T20:32:14.7135882Z     contiguous=True,
2025-05-07T20:32:14.7135965Z     compiled=False,
2025-05-07T20:32:14.7136037Z )
2025-05-07T20:32:14.7136342Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7136524Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.7136528Z 
2025-05-07T20:32:14.7136605Z     @given(
2025-05-07T20:32:14.7136719Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7136819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7136931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7137049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7137161Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7137235Z     )
2025-05-07T20:32:14.7137491Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7137581Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7137657Z         self,
2025-05-07T20:32:14.7137737Z         T: int,
2025-05-07T20:32:14.7137811Z         D: int,
2025-05-07T20:32:14.7137904Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7137999Z         contiguous: bool,
2025-05-07T20:32:14.7138082Z         compiled: bool,
2025-05-07T20:32:14.7138158Z     ) -> None:
2025-05-07T20:32:14.7138255Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7138325Z     
2025-05-07T20:32:14.7138498Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7138571Z     
2025-05-07T20:32:14.7138662Z >       x_sign = torch.sign(x)
2025-05-07T20:32:14.7140615Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7140704Z 
2025-05-07T20:32:14.7140823Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:14.7140827Z 
2025-05-07T20:32:14.7140928Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7141153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7141227Z     T=1,
2025-05-07T20:32:14.7141306Z     D=7168,
2025-05-07T20:32:14.7141389Z     scale_ub=1200.0,
2025-05-07T20:32:14.7141473Z     contiguous=True,
2025-05-07T20:32:14.7141559Z     compiled=False,
2025-05-07T20:32:14.7141627Z )
2025-05-07T20:32:14.7141848Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7142021Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.7142031Z 
2025-05-07T20:32:14.7142109Z     @given(
2025-05-07T20:32:14.7142227Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7142322Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7142439Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7142558Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7142669Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7142741Z     )
2025-05-07T20:32:14.7143002Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7143093Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7143171Z         self,
2025-05-07T20:32:14.7143245Z         T: int,
2025-05-07T20:32:14.7143320Z         D: int,
2025-05-07T20:32:14.7143417Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7143504Z         contiguous: bool,
2025-05-07T20:32:14.7143587Z         compiled: bool,
2025-05-07T20:32:14.7143665Z     ) -> None:
2025-05-07T20:32:14.7143762Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7143832Z     
2025-05-07T20:32:14.7144007Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7144077Z     
2025-05-07T20:32:14.7144247Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7144375Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7144463Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7144543Z         x0 = x[:, :D]
2025-05-07T20:32:14.7144627Z         x1 = x[:, D:]
2025-05-07T20:32:14.7144699Z     
2025-05-07T20:32:14.7144785Z         if contiguous:
2025-05-07T20:32:14.7144877Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7144966Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7145043Z     
2025-05-07T20:32:14.7145131Z         if scale_ub is not None:
2025-05-07T20:32:14.7145234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7145372Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7145445Z             )
2025-05-07T20:32:14.7145523Z         else:
2025-05-07T20:32:14.7145620Z             scale_ub_tensor = None
2025-05-07T20:32:14.7145692Z     
2025-05-07T20:32:14.7145819Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7145917Z             op = silu_mul_quant
2025-05-07T20:32:14.7146000Z             if compiled:
2025-05-07T20:32:14.7146105Z                 op = torch.compile(op)
2025-05-07T20:32:14.7146207Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7146277Z     
2025-05-07T20:32:14.7146371Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7146376Z 
2025-05-07T20:32:14.7146471Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7146602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7146708Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7146807Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7147350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7147532Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7147920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7148157Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7148519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7148612Z     kernel = self.compile(
2025-05-07T20:32:14.7149024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7149203Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7149335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7149339Z 
2025-05-07T20:32:14.7149548Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef27f430>
2025-05-07T20:32:14.7150542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7151092Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef34f550>}
2025-05-07T20:32:14.7151901Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7152101Z context = <triton._C.libtriton.ir.context object at 0x7f1cef34ad70>
2025-05-07T20:32:14.7152105Z 
2025-05-07T20:32:14.7152271Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7152543Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7152657Z                            module_map=module_map)
2025-05-07T20:32:14.7152820Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7153020Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7153098Z E       ^
2025-05-07T20:32:14.7153474Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7153479Z 
2025-05-07T20:32:14.7153925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7153929Z 
2025-05-07T20:32:14.7154028Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7154260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7154334Z     T=128,
2025-05-07T20:32:14.7154412Z     D=5120,
2025-05-07T20:32:14.7154494Z     scale_ub=None,
2025-05-07T20:32:14.7154583Z     contiguous=True,
2025-05-07T20:32:14.7154666Z     compiled=False,
2025-05-07T20:32:14.7154741Z )
2025-05-07T20:32:14.7154963Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7155145Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.7155149Z 
2025-05-07T20:32:14.7155228Z     @given(
2025-05-07T20:32:14.7155346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7155444Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7155559Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7155675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7155790Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7155864Z     )
2025-05-07T20:32:14.7156120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7156212Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7156288Z         self,
2025-05-07T20:32:14.7156443Z         T: int,
2025-05-07T20:32:14.7156526Z         D: int,
2025-05-07T20:32:14.7156624Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7156711Z         contiguous: bool,
2025-05-07T20:32:14.7156802Z         compiled: bool,
2025-05-07T20:32:14.7156878Z     ) -> None:
2025-05-07T20:32:14.7156974Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7157047Z     
2025-05-07T20:32:14.7157215Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7157294Z     
2025-05-07T20:32:14.7157385Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7157507Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7157598Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7157679Z         x0 = x[:, :D]
2025-05-07T20:32:14.7157758Z         x1 = x[:, D:]
2025-05-07T20:32:14.7157837Z     
2025-05-07T20:32:14.7157917Z         if contiguous:
2025-05-07T20:32:14.7158007Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7158097Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7158176Z     
2025-05-07T20:32:14.7158265Z         if scale_ub is not None:
2025-05-07T20:32:14.7158370Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7158508Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7158588Z             )
2025-05-07T20:32:14.7158662Z         else:
2025-05-07T20:32:14.7158754Z             scale_ub_tensor = None
2025-05-07T20:32:14.7158832Z     
2025-05-07T20:32:14.7158960Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7159048Z             op = silu_mul_quant
2025-05-07T20:32:14.7159137Z             if compiled:
2025-05-07T20:32:14.7159235Z                 op = torch.compile(op)
2025-05-07T20:32:14.7159338Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7159417Z     
2025-05-07T20:32:14.7159506Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7159511Z 
2025-05-07T20:32:14.7159608Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7159742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7159842Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7159944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7160566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7160662Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7161048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7161278Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7161643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7161736Z     kernel = self.compile(
2025-05-07T20:32:14.7162143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7162331Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7162457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7162467Z 
2025-05-07T20:32:14.7162676Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef36d8e0>
2025-05-07T20:32:14.7163522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7164066Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef214040>}
2025-05-07T20:32:14.7164882Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7165156Z context = <triton._C.libtriton.ir.context object at 0x7f1cef207af0>
2025-05-07T20:32:14.7165160Z 
2025-05-07T20:32:14.7165335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7165609Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7165713Z                            module_map=module_map)
2025-05-07T20:32:14.7165877Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7165976Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7166052Z E       ^
2025-05-07T20:32:14.7166434Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7166438Z 
2025-05-07T20:32:14.7166881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7166892Z 
2025-05-07T20:32:14.7166999Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7167229Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7167302Z     T=128,
2025-05-07T20:32:14.7167385Z     D=7168,
2025-05-07T20:32:14.7167464Z     scale_ub=None,
2025-05-07T20:32:14.7167550Z     contiguous=True,
2025-05-07T20:32:14.7167636Z     compiled=False,
2025-05-07T20:32:14.7167706Z )
2025-05-07T20:32:14.7167936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7168107Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.7168112Z 
2025-05-07T20:32:14.7168189Z     @given(
2025-05-07T20:32:14.7168310Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7168410Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7168524Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7168643Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7168758Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7168829Z     )
2025-05-07T20:32:14.7169088Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7169334Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7169416Z         self,
2025-05-07T20:32:14.7169491Z         T: int,
2025-05-07T20:32:14.7169566Z         D: int,
2025-05-07T20:32:14.7169669Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7169759Z         contiguous: bool,
2025-05-07T20:32:14.7169843Z         compiled: bool,
2025-05-07T20:32:14.7169926Z     ) -> None:
2025-05-07T20:32:14.7170019Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7170091Z     
2025-05-07T20:32:14.7170264Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7170337Z     
2025-05-07T20:32:14.7170430Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7170558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7170648Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7170731Z         x0 = x[:, :D]
2025-05-07T20:32:14.7170810Z         x1 = x[:, D:]
2025-05-07T20:32:14.7170883Z     
2025-05-07T20:32:14.7170975Z         if contiguous:
2025-05-07T20:32:14.7171065Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7171153Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7171229Z     
2025-05-07T20:32:14.7171318Z         if scale_ub is not None:
2025-05-07T20:32:14.7171421Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7171557Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7171631Z             )
2025-05-07T20:32:14.7171704Z         else:
2025-05-07T20:32:14.7171810Z             scale_ub_tensor = None
2025-05-07T20:32:14.7171911Z     
2025-05-07T20:32:14.7172092Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7172220Z             op = silu_mul_quant
2025-05-07T20:32:14.7172314Z             if compiled:
2025-05-07T20:32:14.7172512Z                 op = torch.compile(op)
2025-05-07T20:32:14.7172658Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7172755Z     
2025-05-07T20:32:14.7172852Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7172862Z 
2025-05-07T20:32:14.7172959Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7173091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7173194Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7173293Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7173838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7173935Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7174316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7174551Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7174919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7175013Z     kernel = self.compile(
2025-05-07T20:32:14.7175433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7175611Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7175745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7175750Z 
2025-05-07T20:32:14.7175960Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef57f280>
2025-05-07T20:32:14.7176807Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7177354Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef214c10>}
2025-05-07T20:32:14.7178260Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7178461Z context = <triton._C.libtriton.ir.context object at 0x7f1cef18d8f0>
2025-05-07T20:32:14.7178466Z 
2025-05-07T20:32:14.7178633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7178911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7179017Z                            module_map=module_map)
2025-05-07T20:32:14.7179181Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7179283Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7179359Z E       ^
2025-05-07T20:32:14.7179742Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7179747Z 
2025-05-07T20:32:14.7180202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7180206Z 
2025-05-07T20:32:14.7180309Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7180541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7180616Z     T=2048,
2025-05-07T20:32:14.7180695Z     D=7168,
2025-05-07T20:32:14.7180781Z     scale_ub=1200.0,
2025-05-07T20:32:14.7180868Z     contiguous=True,
2025-05-07T20:32:14.7180952Z     compiled=False,
2025-05-07T20:32:14.7181029Z )
2025-05-07T20:32:14.7181253Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7181430Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.7181435Z 
2025-05-07T20:32:14.7181595Z     @given(
2025-05-07T20:32:14.7181714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7181812Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7181931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7182045Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7182159Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7182232Z     )
2025-05-07T20:32:14.7182487Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7182582Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7182656Z         self,
2025-05-07T20:32:14.7182732Z         T: int,
2025-05-07T20:32:14.7183147Z         D: int,
2025-05-07T20:32:14.7183248Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7183341Z         contiguous: bool,
2025-05-07T20:32:14.7183425Z         compiled: bool,
2025-05-07T20:32:14.7183504Z     ) -> None:
2025-05-07T20:32:14.7183599Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7183674Z     
2025-05-07T20:32:14.7183845Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7185816Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7185822Z 
2025-05-07T20:32:14.7185938Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7185943Z 
2025-05-07T20:32:14.7186046Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7186275Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7186353Z     T=1,
2025-05-07T20:32:14.7186431Z     D=5120,
2025-05-07T20:32:14.7186513Z     scale_ub=1200.0,
2025-05-07T20:32:14.7186596Z     contiguous=True,
2025-05-07T20:32:14.7186855Z     compiled=False,
2025-05-07T20:32:14.7186937Z )
2025-05-07T20:32:14.7187162Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7187329Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.7187334Z 
2025-05-07T20:32:14.7187410Z     @given(
2025-05-07T20:32:14.7187530Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7187632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7191052Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7191193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7191311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7191386Z     )
2025-05-07T20:32:14.7191655Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7191754Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7191833Z         self,
2025-05-07T20:32:14.7191914Z         T: int,
2025-05-07T20:32:14.7191990Z         D: int,
2025-05-07T20:32:14.7192088Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7192178Z         contiguous: bool,
2025-05-07T20:32:14.7192266Z         compiled: bool,
2025-05-07T20:32:14.7192347Z     ) -> None:
2025-05-07T20:32:14.7192444Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7192521Z     
2025-05-07T20:32:14.7192696Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7192771Z     
2025-05-07T20:32:14.7192863Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7192988Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7193081Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7193165Z         x0 = x[:, :D]
2025-05-07T20:32:14.7193248Z         x1 = x[:, D:]
2025-05-07T20:32:14.7193487Z     
2025-05-07T20:32:14.7193570Z         if contiguous:
2025-05-07T20:32:14.7193659Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7193751Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7193830Z     
2025-05-07T20:32:14.7193922Z         if scale_ub is not None:
2025-05-07T20:32:14.7194032Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7194167Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7194250Z             )
2025-05-07T20:32:14.7194327Z         else:
2025-05-07T20:32:14.7194423Z             scale_ub_tensor = None
2025-05-07T20:32:14.7194502Z     
2025-05-07T20:32:14.7194636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7194729Z             op = silu_mul_quant
2025-05-07T20:32:14.7194820Z             if compiled:
2025-05-07T20:32:14.7194919Z                 op = torch.compile(op)
2025-05-07T20:32:14.7195024Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7195108Z     
2025-05-07T20:32:14.7195198Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7195203Z 
2025-05-07T20:32:14.7195301Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7195439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7195538Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7195644Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7196185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7196285Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7196668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7196905Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7197263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7197364Z     kernel = self.compile(
2025-05-07T20:32:14.7197774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7198037Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7198167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7198172Z 
2025-05-07T20:32:14.7198380Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef1838e0>
2025-05-07T20:32:14.7199234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7199776Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1cef3089d0>}
2025-05-07T20:32:14.7200603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7200798Z context = <triton._C.libtriton.ir.context object at 0x7f1cef15e070>
2025-05-07T20:32:14.7200803Z 
2025-05-07T20:32:14.7200973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7201248Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7201355Z                            module_map=module_map)
2025-05-07T20:32:14.7201522Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7201617Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7201693Z E       ^
2025-05-07T20:32:14.7202079Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7202164Z 
2025-05-07T20:32:14.7202609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7202614Z 
2025-05-07T20:32:14.7202725Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7202955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7203029Z     T=2048,
2025-05-07T20:32:14.7203107Z     D=5120,
2025-05-07T20:32:14.7203187Z     scale_ub=None,
2025-05-07T20:32:14.7203271Z     contiguous=True,
2025-05-07T20:32:14.7203358Z     compiled=False,
2025-05-07T20:32:14.7203429Z )
2025-05-07T20:32:14.7203655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7203833Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.7203838Z 
2025-05-07T20:32:14.7203914Z     @given(
2025-05-07T20:32:14.7204037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7204138Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7204254Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7204373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7204488Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7204562Z     )
2025-05-07T20:32:14.7204824Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7204917Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7204990Z         self,
2025-05-07T20:32:14.7205070Z         T: int,
2025-05-07T20:32:14.7205145Z         D: int,
2025-05-07T20:32:14.7205241Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7205333Z         contiguous: bool,
2025-05-07T20:32:14.7205418Z         compiled: bool,
2025-05-07T20:32:14.7205502Z     ) -> None:
2025-05-07T20:32:14.7205595Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7205669Z     
2025-05-07T20:32:14.7205842Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7205920Z     
2025-05-07T20:32:14.7206011Z >       x_sign = torch.sign(x)
2025-05-07T20:32:14.7208055Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7208062Z 
2025-05-07T20:32:14.7208181Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:14.7208186Z 
2025-05-07T20:32:14.7208292Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7208522Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7208606Z     T=16384,
2025-05-07T20:32:14.7208688Z     D=5120,
2025-05-07T20:32:14.7208770Z     scale_ub=None,
2025-05-07T20:32:14.7208858Z     contiguous=True,
2025-05-07T20:32:14.7208941Z     compiled=False,
2025-05-07T20:32:14.7209018Z )
2025-05-07T20:32:14.7209246Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7209425Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.7209429Z 
2025-05-07T20:32:14.7209507Z     @given(
2025-05-07T20:32:14.7209631Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7209726Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7209857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7209990Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7210117Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7210199Z     )
2025-05-07T20:32:14.7210455Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7210661Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7210741Z         self,
2025-05-07T20:32:14.7210816Z         T: int,
2025-05-07T20:32:14.7210898Z         D: int,
2025-05-07T20:32:14.7210999Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7211086Z         contiguous: bool,
2025-05-07T20:32:14.7211167Z         compiled: bool,
2025-05-07T20:32:14.7211246Z     ) -> None:
2025-05-07T20:32:14.7211338Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7211409Z     
2025-05-07T20:32:14.7211580Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7213537Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7213548Z 
2025-05-07T20:32:14.7213667Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7213672Z 
2025-05-07T20:32:14.7213770Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7214002Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7214079Z     T=4096,
2025-05-07T20:32:14.7214152Z     D=5120,
2025-05-07T20:32:14.7214235Z     scale_ub=None,
2025-05-07T20:32:14.7214317Z     contiguous=True,
2025-05-07T20:32:14.7214400Z     compiled=False,
2025-05-07T20:32:14.7214485Z )
2025-05-07T20:32:14.7214707Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7214880Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.7214892Z 
2025-05-07T20:32:14.7214966Z     @given(
2025-05-07T20:32:14.7215080Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7215181Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7215376Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7215493Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7215604Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7215679Z     )
2025-05-07T20:32:14.7215933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7216032Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7216108Z         self,
2025-05-07T20:32:14.7216185Z         T: int,
2025-05-07T20:32:14.7216265Z         D: int,
2025-05-07T20:32:14.7216359Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7216456Z         contiguous: bool,
2025-05-07T20:32:14.7216540Z         compiled: bool,
2025-05-07T20:32:14.7216618Z     ) -> None:
2025-05-07T20:32:14.7216722Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7216793Z     
2025-05-07T20:32:14.7216960Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7218913Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7218919Z 
2025-05-07T20:32:14.7219033Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7219038Z 
2025-05-07T20:32:14.7219141Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7219471Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7219545Z     T=2048,
2025-05-07T20:32:14.7219621Z     D=5120,
2025-05-07T20:32:14.7219702Z     scale_ub=None,
2025-05-07T20:32:14.7219801Z     contiguous=False,
2025-05-07T20:32:14.7219886Z     compiled=False,
2025-05-07T20:32:14.7219957Z )
2025-05-07T20:32:14.7220184Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7220376Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.7220382Z 
2025-05-07T20:32:14.7220461Z     @given(
2025-05-07T20:32:14.7220603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7220700Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7220811Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7220930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7221039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7221122Z     )
2025-05-07T20:32:14.7221378Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7221469Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7221548Z         self,
2025-05-07T20:32:14.7221627Z         T: int,
2025-05-07T20:32:14.7221701Z         D: int,
2025-05-07T20:32:14.7221801Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7221890Z         contiguous: bool,
2025-05-07T20:32:14.7221972Z         compiled: bool,
2025-05-07T20:32:14.7222050Z     ) -> None:
2025-05-07T20:32:14.7222142Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7222218Z     
2025-05-07T20:32:14.7222390Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7224411Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7224423Z 
2025-05-07T20:32:14.7224543Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7224548Z 
2025-05-07T20:32:14.7224647Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7224876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7224952Z     T=4096,
2025-05-07T20:32:14.7225028Z     D=7168,
2025-05-07T20:32:14.7225113Z     scale_ub=None,
2025-05-07T20:32:14.7225196Z     contiguous=True,
2025-05-07T20:32:14.7225276Z     compiled=True,
2025-05-07T20:32:14.7225351Z )
2025-05-07T20:32:14.7225574Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7225748Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.7225756Z 
2025-05-07T20:32:14.7225833Z     @given(
2025-05-07T20:32:14.7225949Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7226054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7226169Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7226282Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7226395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7226471Z     )
2025-05-07T20:32:14.7226727Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7226822Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7226898Z         self,
2025-05-07T20:32:14.7226972Z         T: int,
2025-05-07T20:32:14.7227050Z         D: int,
2025-05-07T20:32:14.7227146Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7227235Z         contiguous: bool,
2025-05-07T20:32:14.7227400Z         compiled: bool,
2025-05-07T20:32:14.7227477Z     ) -> None:
2025-05-07T20:32:14.7227573Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7227644Z     
2025-05-07T20:32:14.7227814Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7229839Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7229845Z 
2025-05-07T20:32:14.7229960Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7229964Z 
2025-05-07T20:32:14.7230076Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7230302Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7230379Z     T=2048,
2025-05-07T20:32:14.7230459Z     D=5120,
2025-05-07T20:32:14.7230541Z     scale_ub=1200.0,
2025-05-07T20:32:14.7230629Z     contiguous=False,
2025-05-07T20:32:14.7230712Z     compiled=False,
2025-05-07T20:32:14.7230782Z )
2025-05-07T20:32:14.7231008Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7231186Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.7231190Z 
2025-05-07T20:32:14.7231264Z     @given(
2025-05-07T20:32:14.7231383Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7231477Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7231589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7231706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7231823Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7231896Z     )
2025-05-07T20:32:14.7232151Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7232325Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7232403Z         self,
2025-05-07T20:32:14.7232478Z         T: int,
2025-05-07T20:32:14.7232553Z         D: int,
2025-05-07T20:32:14.7232653Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7232741Z         contiguous: bool,
2025-05-07T20:32:14.7232823Z         compiled: bool,
2025-05-07T20:32:14.7232903Z     ) -> None:
2025-05-07T20:32:14.7232996Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7233066Z     
2025-05-07T20:32:14.7233241Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7235186Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7235196Z 
2025-05-07T20:32:14.7235313Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7235317Z 
2025-05-07T20:32:14.7235417Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7235645Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7235721Z     T=4096,
2025-05-07T20:32:14.7235794Z     D=7168,
2025-05-07T20:32:14.7235879Z     scale_ub=1200.0,
2025-05-07T20:32:14.7235961Z     contiguous=True,
2025-05-07T20:32:14.7236046Z     compiled=False,
2025-05-07T20:32:14.7236123Z )
2025-05-07T20:32:14.7236345Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7236601Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.7236609Z 
2025-05-07T20:32:14.7236684Z     @given(
2025-05-07T20:32:14.7236803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7236904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7237018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7237134Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7237247Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7237320Z     )
2025-05-07T20:32:14.7237573Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7237669Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7237742Z         self,
2025-05-07T20:32:14.7237819Z         T: int,
2025-05-07T20:32:14.7237898Z         D: int,
2025-05-07T20:32:14.7237996Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7238096Z         contiguous: bool,
2025-05-07T20:32:14.7238179Z         compiled: bool,
2025-05-07T20:32:14.7238255Z     ) -> None:
2025-05-07T20:32:14.7238353Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7238434Z     
2025-05-07T20:32:14.7238603Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7240559Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7240565Z 
2025-05-07T20:32:14.7240678Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7240688Z 
2025-05-07T20:32:14.7240791Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7241018Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7241175Z     T=16384,
2025-05-07T20:32:14.7241261Z     D=7168,
2025-05-07T20:32:14.7241344Z     scale_ub=None,
2025-05-07T20:32:14.7241434Z     contiguous=False,
2025-05-07T20:32:14.7241517Z     compiled=True,
2025-05-07T20:32:14.7241586Z )
2025-05-07T20:32:14.7241811Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7241987Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:14.7241992Z 
2025-05-07T20:32:14.7242064Z     @given(
2025-05-07T20:32:14.7242182Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7242276Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7242386Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7242507Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7242617Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7242695Z     )
2025-05-07T20:32:14.7242954Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7243046Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7243123Z         self,
2025-05-07T20:32:14.7243200Z         T: int,
2025-05-07T20:32:14.7243278Z         D: int,
2025-05-07T20:32:14.7243380Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7243466Z         contiguous: bool,
2025-05-07T20:32:14.7243549Z         compiled: bool,
2025-05-07T20:32:14.7243627Z     ) -> None:
2025-05-07T20:32:14.7243719Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7243791Z     
2025-05-07T20:32:14.7243963Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7245922Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7246008Z 
2025-05-07T20:32:14.7246125Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7246130Z 
2025-05-07T20:32:14.7246229Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7246457Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7246533Z     T=4096,
2025-05-07T20:32:14.7246609Z     D=7168,
2025-05-07T20:32:14.7246691Z     scale_ub=None,
2025-05-07T20:32:14.7246775Z     contiguous=True,
2025-05-07T20:32:14.7246855Z     compiled=False,
2025-05-07T20:32:14.7246935Z )
2025-05-07T20:32:14.7247160Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7247331Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.7247343Z 
2025-05-07T20:32:14.7247418Z     @given(
2025-05-07T20:32:14.7247534Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7247634Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7247744Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7247856Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7247970Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7248044Z     )
2025-05-07T20:32:14.7248300Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7248397Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7248470Z         self,
2025-05-07T20:32:14.7248546Z         T: int,
2025-05-07T20:32:14.7248622Z         D: int,
2025-05-07T20:32:14.7248723Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7248813Z         contiguous: bool,
2025-05-07T20:32:14.7248895Z         compiled: bool,
2025-05-07T20:32:14.7248973Z     ) -> None:
2025-05-07T20:32:14.7249150Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7249225Z     
2025-05-07T20:32:14.7249392Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7251346Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7251355Z 
2025-05-07T20:32:14.7251468Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7251472Z 
2025-05-07T20:32:14.7251579Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7251812Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7251890Z     T=16384,
2025-05-07T20:32:14.7251970Z     D=7168,
2025-05-07T20:32:14.7252048Z     scale_ub=None,
2025-05-07T20:32:14.7252134Z     contiguous=True,
2025-05-07T20:32:14.7252219Z     compiled=False,
2025-05-07T20:32:14.7252292Z )
2025-05-07T20:32:14.7252519Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7252696Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:14.7252700Z 
2025-05-07T20:32:14.7252775Z     @given(
2025-05-07T20:32:14.7252903Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7252998Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7253213Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7253332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7253442Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7253528Z     )
2025-05-07T20:32:14.7253783Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7253875Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7253955Z         self,
2025-05-07T20:32:14.7254032Z         T: int,
2025-05-07T20:32:14.7254107Z         D: int,
2025-05-07T20:32:14.7254208Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7254302Z         contiguous: bool,
2025-05-07T20:32:14.7254387Z         compiled: bool,
2025-05-07T20:32:14.7254466Z     ) -> None:
2025-05-07T20:32:14.7254561Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7254635Z     
2025-05-07T20:32:14.7254806Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7256763Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7256773Z 
2025-05-07T20:32:14.7256889Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7256894Z 
2025-05-07T20:32:14.7256993Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7257220Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7257302Z     T=16384,
2025-05-07T20:32:14.7257377Z     D=7168,
2025-05-07T20:32:14.7257459Z     scale_ub=1200.0,
2025-05-07T20:32:14.7257549Z     contiguous=True,
2025-05-07T20:32:14.7257629Z     compiled=False,
2025-05-07T20:32:14.7257699Z )
2025-05-07T20:32:14.7257924Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7258183Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.7258188Z 
2025-05-07T20:32:14.7258270Z     @given(
2025-05-07T20:32:14.7258384Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7258479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7258591Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7258704Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7258814Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7258890Z     )
2025-05-07T20:32:14.7259144Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7259235Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7259313Z         self,
2025-05-07T20:32:14.7259395Z         T: int,
2025-05-07T20:32:14.7259475Z         D: int,
2025-05-07T20:32:14.7259571Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7259657Z         contiguous: bool,
2025-05-07T20:32:14.7259749Z         compiled: bool,
2025-05-07T20:32:14.7259824Z     ) -> None:
2025-05-07T20:32:14.7259918Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7259995Z     
2025-05-07T20:32:14.7260161Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7262113Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7262199Z 
2025-05-07T20:32:14.7262313Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7262318Z 
2025-05-07T20:32:14.7262422Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7262652Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7262729Z     T=128,
2025-05-07T20:32:14.7262809Z     D=5120,
2025-05-07T20:32:14.7262888Z     scale_ub=1200.0,
2025-05-07T20:32:14.7262975Z     contiguous=False,
2025-05-07T20:32:14.7263060Z     compiled=False,
2025-05-07T20:32:14.7263132Z )
2025-05-07T20:32:14.7263354Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7263530Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:14.7263534Z 
2025-05-07T20:32:14.7263609Z     @given(
2025-05-07T20:32:14.7263726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7263831Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7263944Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7264064Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7264179Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7264253Z     )
2025-05-07T20:32:14.7264511Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7264601Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7264676Z         self,
2025-05-07T20:32:14.7264754Z         T: int,
2025-05-07T20:32:14.7264830Z         D: int,
2025-05-07T20:32:14.7264925Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7265016Z         contiguous: bool,
2025-05-07T20:32:14.7265099Z         compiled: bool,
2025-05-07T20:32:14.7265175Z     ) -> None:
2025-05-07T20:32:14.7265273Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7265346Z     
2025-05-07T20:32:14.7265517Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7265598Z     
2025-05-07T20:32:14.7265689Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7265816Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7265983Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7266066Z         x0 = x[:, :D]
2025-05-07T20:32:14.7266148Z         x1 = x[:, D:]
2025-05-07T20:32:14.7266217Z     
2025-05-07T20:32:14.7266298Z         if contiguous:
2025-05-07T20:32:14.7266395Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7266483Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7266552Z     
2025-05-07T20:32:14.7266643Z         if scale_ub is not None:
2025-05-07T20:32:14.7266748Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7266882Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7266959Z             )
2025-05-07T20:32:14.7267031Z         else:
2025-05-07T20:32:14.7267127Z             scale_ub_tensor = None
2025-05-07T20:32:14.7267203Z     
2025-05-07T20:32:14.7267332Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7267425Z             op = silu_mul_quant
2025-05-07T20:32:14.7267510Z             if compiled:
2025-05-07T20:32:14.7267617Z                 op = torch.compile(op)
2025-05-07T20:32:14.7267726Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7267801Z     
2025-05-07T20:32:14.7267890Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7267895Z 
2025-05-07T20:32:14.7267994Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7268125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7268231Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7268331Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7268869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7268973Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7269438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7269671Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7270132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7270224Z     kernel = self.compile(
2025-05-07T20:32:14.7270637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7270816Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7270943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7270948Z 
2025-05-07T20:32:14.7271161Z self = <triton.compiler.compiler.ASTSource object at 0x7f1cef108490>
2025-05-07T20:32:14.7272006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7272568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ceefeb670>}
2025-05-07T20:32:14.7273380Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7273575Z context = <triton._C.libtriton.ir.context object at 0x7f1ceef888b0>
2025-05-07T20:32:14.7273582Z 
2025-05-07T20:32:14.7273749Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7274025Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7274134Z                            module_map=module_map)
2025-05-07T20:32:14.7274299Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7274399Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7274483Z E       ^
2025-05-07T20:32:14.7274945Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7274950Z 
2025-05-07T20:32:14.7275398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7275403Z 
2025-05-07T20:32:14.7275503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7275732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7275814Z     T=2048,
2025-05-07T20:32:14.7275887Z     D=7168,
2025-05-07T20:32:14.7275967Z     scale_ub=None,
2025-05-07T20:32:14.7276055Z     contiguous=False,
2025-05-07T20:32:14.7276138Z     compiled=False,
2025-05-07T20:32:14.7276210Z )
2025-05-07T20:32:14.7276443Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7276620Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:14.7276624Z 
2025-05-07T20:32:14.7276711Z     @given(
2025-05-07T20:32:14.7276831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7276926Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7277043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7277156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7277266Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7277348Z     )
2025-05-07T20:32:14.7277604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7277698Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7277772Z         self,
2025-05-07T20:32:14.7277849Z         T: int,
2025-05-07T20:32:14.7277932Z         D: int,
2025-05-07T20:32:14.7278028Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7278198Z         contiguous: bool,
2025-05-07T20:32:14.7278288Z         compiled: bool,
2025-05-07T20:32:14.7278366Z     ) -> None:
2025-05-07T20:32:14.7278462Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7278542Z     
2025-05-07T20:32:14.7278713Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7280676Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7280682Z 
2025-05-07T20:32:14.7280802Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7280806Z 
2025-05-07T20:32:14.7280908Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7281145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7281223Z     T=128,
2025-05-07T20:32:14.7281303Z     D=7168,
2025-05-07T20:32:14.7281388Z     scale_ub=1200.0,
2025-05-07T20:32:14.7281470Z     contiguous=True,
2025-05-07T20:32:14.7281557Z     compiled=True,
2025-05-07T20:32:14.7281627Z )
2025-05-07T20:32:14.7281850Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7282023Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.7282028Z 
2025-05-07T20:32:14.7282101Z     @given(
2025-05-07T20:32:14.7282219Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7282314Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7282425Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7282548Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7282659Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7283022Z     )
2025-05-07T20:32:14.7283588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7283705Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7283782Z         self,
2025-05-07T20:32:14.7283861Z         T: int,
2025-05-07T20:32:14.7283938Z         D: int,
2025-05-07T20:32:14.7284034Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7284127Z         contiguous: bool,
2025-05-07T20:32:14.7284211Z         compiled: bool,
2025-05-07T20:32:14.7284290Z     ) -> None:
2025-05-07T20:32:14.7284388Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7284461Z     
2025-05-07T20:32:14.7284635Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7284707Z     
2025-05-07T20:32:14.7284795Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7284926Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7285013Z         x = x_sign * x_clamp
2025-05-07T20:32:14.7285094Z         x0 = x[:, :D]
2025-05-07T20:32:14.7285179Z         x1 = x[:, D:]
2025-05-07T20:32:14.7285253Z     
2025-05-07T20:32:14.7285335Z         if contiguous:
2025-05-07T20:32:14.7285430Z             x0 = x0.contiguous()
2025-05-07T20:32:14.7285515Z             x1 = x1.contiguous()
2025-05-07T20:32:14.7285589Z     
2025-05-07T20:32:14.7285683Z         if scale_ub is not None:
2025-05-07T20:32:14.7285786Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:14.7285926Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:14.7285998Z             )
2025-05-07T20:32:14.7286070Z         else:
2025-05-07T20:32:14.7286165Z             scale_ub_tensor = None
2025-05-07T20:32:14.7286238Z     
2025-05-07T20:32:14.7286368Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:14.7286461Z             op = silu_mul_quant
2025-05-07T20:32:14.7286666Z             if compiled:
2025-05-07T20:32:14.7286765Z                 op = torch.compile(op)
2025-05-07T20:32:14.7286877Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7286956Z     
2025-05-07T20:32:14.7287048Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:14.7287053Z 
2025-05-07T20:32:14.7287152Z moe/activation_test.py:117: 
2025-05-07T20:32:14.7287284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7287387Z moe/activation_test.py:115: in fn
2025-05-07T20:32:14.7287484Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:14.7287877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:14.7287973Z     return fn(*args, **kwargs)
2025-05-07T20:32:14.7288507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:14.7288608Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:14.7288995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:14.7289230Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:14.7289594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:14.7289689Z     kernel = self.compile(
2025-05-07T20:32:14.7290097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:14.7290277Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:14.7290404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:14.7290409Z 
2025-05-07T20:32:14.7290622Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ceef5bf70>
2025-05-07T20:32:14.7291468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:14.7292100Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f122075e0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ceefda5e0>}
2025-05-07T20:32:14.7292915Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:14.7293110Z context = <triton._C.libtriton.ir.context object at 0x7f1ceef23270>
2025-05-07T20:32:14.7293115Z 
2025-05-07T20:32:14.7293284Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:14.7293558Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:14.7293669Z                            module_map=module_map)
2025-05-07T20:32:14.7293833Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:14.7293929Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:14.7294016Z E       ^
2025-05-07T20:32:14.7294397Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:14.7294402Z 
2025-05-07T20:32:14.7294845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:14.7294850Z 
2025-05-07T20:32:14.7294959Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7295188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7295267Z     T=128,
2025-05-07T20:32:14.7295347Z     D=7168,
2025-05-07T20:32:14.7295430Z     scale_ub=1200.0,
2025-05-07T20:32:14.7295518Z     contiguous=True,
2025-05-07T20:32:14.7295602Z     compiled=False,
2025-05-07T20:32:14.7295751Z )
2025-05-07T20:32:14.7295977Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7296156Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:14.7296160Z 
2025-05-07T20:32:14.7296232Z     @given(
2025-05-07T20:32:14.7296352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7296446Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7296560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7296677Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7296790Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7296868Z     )
2025-05-07T20:32:14.7297124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7297214Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7297292Z         self,
2025-05-07T20:32:14.7297369Z         T: int,
2025-05-07T20:32:14.7297450Z         D: int,
2025-05-07T20:32:14.7297550Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7297636Z         contiguous: bool,
2025-05-07T20:32:14.7297719Z         compiled: bool,
2025-05-07T20:32:14.7297799Z     ) -> None:
2025-05-07T20:32:14.7297894Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7297966Z     
2025-05-07T20:32:14.7298139Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7298212Z     
2025-05-07T20:32:14.7298306Z         x_sign = torch.sign(x)
2025-05-07T20:32:14.7298428Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:14.7300381Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7300394Z 
2025-05-07T20:32:14.7300587Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:14.7300593Z 
2025-05-07T20:32:14.7300693Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7300926Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7301001Z     T=128,
2025-05-07T20:32:14.7301075Z     D=5120,
2025-05-07T20:32:14.7301160Z     scale_ub=1200.0,
2025-05-07T20:32:14.7301243Z     contiguous=True,
2025-05-07T20:32:14.7301329Z     compiled=True,
2025-05-07T20:32:14.7301405Z )
2025-05-07T20:32:14.7301630Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7301803Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:14.7301808Z 
2025-05-07T20:32:14.7301885Z     @given(
2025-05-07T20:32:14.7302000Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7302096Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7302215Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7302331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7302445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7302516Z     )
2025-05-07T20:32:14.7302773Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7302868Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7302944Z         self,
2025-05-07T20:32:14.7303022Z         T: int,
2025-05-07T20:32:14.7303098Z         D: int,
2025-05-07T20:32:14.7303195Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7303285Z         contiguous: bool,
2025-05-07T20:32:14.7303368Z         compiled: bool,
2025-05-07T20:32:14.7303445Z     ) -> None:
2025-05-07T20:32:14.7303544Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7303769Z     
2025-05-07T20:32:14.7303937Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7304014Z     
2025-05-07T20:32:14.7304108Z >       x_sign = torch.sign(x)
2025-05-07T20:32:14.7306056Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7306062Z 
2025-05-07T20:32:14.7306176Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:14.7306181Z 
2025-05-07T20:32:14.7306286Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:14.7306519Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:14.7306594Z     T=128,
2025-05-07T20:32:14.7306675Z     D=7168,
2025-05-07T20:32:14.7306760Z     scale_ub=None,
2025-05-07T20:32:14.7306843Z     contiguous=True,
2025-05-07T20:32:14.7306930Z     compiled=True,
2025-05-07T20:32:14.7307000Z )
2025-05-07T20:32:14.7307223Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:14.7307396Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:14.7307400Z 
2025-05-07T20:32:14.7307477Z     @given(
2025-05-07T20:32:14.7307595Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:14.7307689Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:14.7307806Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:14.7307924Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:14.7308036Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:14.7308116Z     )
2025-05-07T20:32:14.7308379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:14.7308470Z     def test_silu_mul_quant(
2025-05-07T20:32:14.7308624Z         self,
2025-05-07T20:32:14.7308705Z         T: int,
2025-05-07T20:32:14.7308778Z         D: int,
2025-05-07T20:32:14.7308875Z         scale_ub: Optional[float],
2025-05-07T20:32:14.7308965Z         contiguous: bool,
2025-05-07T20:32:14.7309047Z         compiled: bool,
2025-05-07T20:32:14.7309129Z     ) -> None:
2025-05-07T20:32:14.7309223Z         torch.manual_seed(2025)
2025-05-07T20:32:14.7309293Z     
2025-05-07T20:32:14.7309464Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:14.7311500Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:14.7311510Z 
2025-05-07T20:32:14.7311632Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:14.7311800Z =============================== warnings summary ===============================
2025-05-07T20:32:14.7312171Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:14.7312491Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:14.7312801Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:14.7313862Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:14.7314101Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:14.7314106Z 
2025-05-07T20:32:14.7314286Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:32:14.7315671Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:32:14.7315859Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:32:14.7315864Z 
2025-05-07T20:32:14.7316086Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:14.7316253Z ================== 1 failed, 1 passed, 13 warnings in 33.02s ===================
2025-05-07T20:32:16.4578685Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:16.5204297Z 
2025-05-07T20:32:16.5204745Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:32:16.5205128Z 
2025-05-07T20:32:16.5205133Z 
2025-05-07T20:32:16.5227474Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:18.6725584Z ============================= test session starts ==============================
2025-05-07T20:32:18.6726277Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:18.6726874Z cachedir: .pytest_cache
2025-05-07T20:32:18.6727830Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:18.6728622Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:18.6729053Z plugins: hypothesis-6.131.14
2025-05-07T20:32:20.3034768Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:20.5175526Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:20.5175984Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:20.5176218Z 
2025-05-07T20:32:22.7421014Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:22.7422208Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:22.7423744Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:22.7425362Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:22.7426901Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:22.7428445Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.7430460Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:22.7431984Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.7433555Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:22.7434935Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:22.7436277Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:22.7437632Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:22.7438771Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:22.7439885Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:22.7441229Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:22.7442665Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:22.7444049Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:22.7445193Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:22.7446485Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:22.7447987Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:22.7449151Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.7450145Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.7450946Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:22.7452055Z W0507 20:32:22.740729 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.7594288Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:22.7595440Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:22.7597157Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:22.7598731Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:22.7600248Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:22.7601773Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.7603220Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:22.7604736Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.7606295Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:22.7607664Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:22.7608995Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:22.7610460Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:22.7611607Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:22.7612771Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:22.7614104Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:22.7615513Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:22.7616742Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:22.7617879Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:22.7619171Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:22.7620657Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:22.7621818Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.7622949Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.7623751Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:22.7624866Z W0507 20:32:22.758873 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.4140370Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.4141117Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.4141549Z     T=1,
2025-05-07T20:32:23.4141741Z     D=5120,
2025-05-07T20:32:23.4141934Z     scale_ub=None,
2025-05-07T20:32:23.4142174Z     contiguous=True,
2025-05-07T20:32:23.4142396Z     compiled=True,
2025-05-07T20:32:23.4142606Z )
2025-05-07T20:32:23.4142932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.4143463Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:23.4143742Z 
2025-05-07T20:32:23.4143826Z     @given(
2025-05-07T20:32:23.4144065Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.4144381Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.4144697Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.4145036Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.4145367Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.4145664Z     )
2025-05-07T20:32:23.4146030Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.4146495Z     def test_silu_mul_quant(
2025-05-07T20:32:23.4146746Z         self,
2025-05-07T20:32:23.4146948Z         T: int,
2025-05-07T20:32:23.4147141Z         D: int,
2025-05-07T20:32:23.4147363Z         scale_ub: Optional[float],
2025-05-07T20:32:23.4147643Z         contiguous: bool,
2025-05-07T20:32:23.4148230Z         compiled: bool,
2025-05-07T20:32:23.4148465Z     ) -> None:
2025-05-07T20:32:23.4148681Z         torch.manual_seed(2025)
2025-05-07T20:32:23.4148923Z     
2025-05-07T20:32:23.4149193Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.4149544Z     
2025-05-07T20:32:23.4150053Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.4150344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.4150666Z         x = x_sign * x_clamp
2025-05-07T20:32:23.4150911Z         x0 = x[:, :D]
2025-05-07T20:32:23.4151120Z         x1 = x[:, D:]
2025-05-07T20:32:23.4151326Z     
2025-05-07T20:32:23.4151509Z         if contiguous:
2025-05-07T20:32:23.4151734Z             x0 = x0.contiguous()
2025-05-07T20:32:23.4151997Z             x1 = x1.contiguous()
2025-05-07T20:32:23.4152251Z     
2025-05-07T20:32:23.4152435Z         if scale_ub is not None:
2025-05-07T20:32:23.4152714Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.4153064Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.4153383Z             )
2025-05-07T20:32:23.4153574Z         else:
2025-05-07T20:32:23.4153785Z             scale_ub_tensor = None
2025-05-07T20:32:23.4154042Z     
2025-05-07T20:32:23.4154270Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.4154596Z             op = silu_mul_quant
2025-05-07T20:32:23.4154853Z             if compiled:
2025-05-07T20:32:23.4155099Z                 op = torch.compile(op)
2025-05-07T20:32:23.4155410Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.4155699Z     
2025-05-07T20:32:23.4155887Z         y_fp8, y_scale = fn()
2025-05-07T20:32:23.4156182Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:23.4156486Z     
2025-05-07T20:32:23.4156883Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.4157234Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:23.4157534Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:23.4157865Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:23.4158235Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:23.4158556Z     
2025-05-07T20:32:23.4158756Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:23.4158959Z 
2025-05-07T20:32:23.4159058Z moe/activation_test.py:126: 
2025-05-07T20:32:23.4159362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.4159714Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:23.4160044Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:23.4160898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:23.4161721Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:23.4162303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.4163035Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.4163774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:23.4164546Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:23.4165356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:23.4166156Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:23.4166938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:23.4167624Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:23.4168261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:23.4168912Z     fn()
2025-05-07T20:32:23.4169449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:23.4170071Z     self.fn.run(
2025-05-07T20:32:23.4170551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.4171115Z     kernel = self.compile(
2025-05-07T20:32:23.4171687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.4172374Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.4172788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.4173043Z 
2025-05-07T20:32:23.4173256Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd819d0d0>
2025-05-07T20:32:23.4174431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.4176016Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd80e7040>}
2025-05-07T20:32:23.4177489Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.4178591Z context = <triton._C.libtriton.ir.context object at 0x7f3dd8b1f270>
2025-05-07T20:32:23.4178902Z 
2025-05-07T20:32:23.4179075Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.4179706Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.4180201Z                            module_map=module_map)
2025-05-07T20:32:23.4180572Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.4180939Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:23.4181210Z E       ^
2025-05-07T20:32:23.4181693Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.4182186Z 
2025-05-07T20:32:23.4182634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.4183468Z 
2025-05-07T20:32:23.4183574Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.4184005Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.4184425Z     T=2048,
2025-05-07T20:32:23.4184621Z     D=5120,
2025-05-07T20:32:23.4184820Z     scale_ub=1200.0,
2025-05-07T20:32:23.4185038Z     contiguous=True,
2025-05-07T20:32:23.4185274Z     compiled=False,
2025-05-07T20:32:23.4185478Z )
2025-05-07T20:32:24.4478480Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:24.4480832Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:24.4483163Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:24.4484759Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:24.4486662Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:24.4488194Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.4489636Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:24.4491153Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.4492716Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:24.4494088Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:24.4495429Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:24.4496761Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:24.4497900Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:24.4499167Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:24.4500514Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:24.4501927Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:24.4503147Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:24.4504288Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:24.4505587Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:24.4507093Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:24.4508253Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.4509239Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.4510143Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:24.4511250Z W0507 20:32:24.443299 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.6828149Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:24.6829311Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:24.6830917Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:24.6832483Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:24.6834005Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:24.6835531Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.6836966Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:24.6838475Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.6840036Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:24.6841532Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:24.6842864Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:24.6844193Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:24.6845331Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:24.6846453Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:24.6847790Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:24.6849202Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:24.6850419Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:24.6851560Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:24.6852911Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:24.6854484Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:24.6855650Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.6856641Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.6857443Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:24.6858550Z W0507 20:32:24.678800 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.5545340Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.5546452Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.5547066Z 
2025-05-07T20:32:25.5547230Z     @given(
2025-05-07T20:32:25.5547716Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.5548362Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.5548998Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.5549919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.5550597Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.5551192Z     )
2025-05-07T20:32:25.5551912Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.5552582Z     def test_silu_mul_quant(
2025-05-07T20:32:25.5552851Z         self,
2025-05-07T20:32:25.5553208Z         T: int,
2025-05-07T20:32:25.5553405Z         D: int,
2025-05-07T20:32:25.5553621Z         scale_ub: Optional[float],
2025-05-07T20:32:25.5553895Z         contiguous: bool,
2025-05-07T20:32:25.5554134Z         compiled: bool,
2025-05-07T20:32:25.5554367Z     ) -> None:
2025-05-07T20:32:25.5554579Z         torch.manual_seed(2025)
2025-05-07T20:32:25.5554820Z     
2025-05-07T20:32:25.5555094Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.5555452Z     
2025-05-07T20:32:25.5555644Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.5555939Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.5556256Z         x = x_sign * x_clamp
2025-05-07T20:32:25.5556492Z         x0 = x[:, :D]
2025-05-07T20:32:25.5556709Z         x1 = x[:, D:]
2025-05-07T20:32:25.5556919Z     
2025-05-07T20:32:25.5557097Z         if contiguous:
2025-05-07T20:32:25.5557328Z             x0 = x0.contiguous()
2025-05-07T20:32:25.5557591Z             x1 = x1.contiguous()
2025-05-07T20:32:25.5557836Z     
2025-05-07T20:32:25.5558028Z         if scale_ub is not None:
2025-05-07T20:32:25.5558304Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.5558643Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.5558964Z             )
2025-05-07T20:32:25.5559156Z         else:
2025-05-07T20:32:25.5559359Z             scale_ub_tensor = None
2025-05-07T20:32:25.5559615Z     
2025-05-07T20:32:25.5559847Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.5560162Z             op = silu_mul_quant
2025-05-07T20:32:25.5560419Z             if compiled:
2025-05-07T20:32:25.5560665Z                 op = torch.compile(op)
2025-05-07T20:32:25.5560972Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.5561249Z     
2025-05-07T20:32:25.5561441Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.5561610Z 
2025-05-07T20:32:25.5561714Z moe/activation_test.py:117: 
2025-05-07T20:32:25.5562013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.5562364Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.5562655Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.5563518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.5564271Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.5564841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.5565574Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.5566277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.5566845Z     kernel = self.compile(
2025-05-07T20:32:25.5567417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.5568116Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.5568533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.5568780Z 
2025-05-07T20:32:25.5568998Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd813bb80>
2025-05-07T20:32:25.5570171Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.5571678Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd85aa4c0>}
2025-05-07T20:32:25.5573143Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.5574357Z context = <triton._C.libtriton.ir.context object at 0x7f3dd715cdb0>
2025-05-07T20:32:25.5574670Z 
2025-05-07T20:32:25.5574841Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.5575389Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.5575872Z                            module_map=module_map)
2025-05-07T20:32:25.5576246Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.5576606Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.5576861Z E       ^
2025-05-07T20:32:25.5577352Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.5577846Z 
2025-05-07T20:32:25.5578292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.5578846Z 
2025-05-07T20:32:25.5578960Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.5579381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.5579805Z     T=2048,
2025-05-07T20:32:25.5579994Z     D=5120,
2025-05-07T20:32:25.5580185Z     scale_ub=1200.0,
2025-05-07T20:32:25.5580407Z     contiguous=True,
2025-05-07T20:32:25.5580628Z     compiled=True,
2025-05-07T20:32:25.5580832Z )
2025-05-07T20:32:25.5581157Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.5581675Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.5581961Z 
2025-05-07T20:32:25.5582047Z     @given(
2025-05-07T20:32:25.5582274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.5582640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.5583142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.5583760Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.5584232Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.5584617Z     )
2025-05-07T20:32:25.5585076Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.5585701Z     def test_silu_mul_quant(
2025-05-07T20:32:25.5585953Z         self,
2025-05-07T20:32:25.5586149Z         T: int,
2025-05-07T20:32:25.5586341Z         D: int,
2025-05-07T20:32:25.5586559Z         scale_ub: Optional[float],
2025-05-07T20:32:25.5586837Z         contiguous: bool,
2025-05-07T20:32:25.5587073Z         compiled: bool,
2025-05-07T20:32:25.5587297Z     ) -> None:
2025-05-07T20:32:25.5587514Z         torch.manual_seed(2025)
2025-05-07T20:32:25.5587754Z     
2025-05-07T20:32:25.5588031Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.5588385Z     
2025-05-07T20:32:25.5588571Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.5588867Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.5589188Z         x = x_sign * x_clamp
2025-05-07T20:32:25.5589428Z         x0 = x[:, :D]
2025-05-07T20:32:25.5589746Z         x1 = x[:, D:]
2025-05-07T20:32:25.5589954Z     
2025-05-07T20:32:25.5590134Z         if contiguous:
2025-05-07T20:32:25.5590369Z             x0 = x0.contiguous()
2025-05-07T20:32:25.5590631Z             x1 = x1.contiguous()
2025-05-07T20:32:25.5590869Z     
2025-05-07T20:32:25.5591057Z         if scale_ub is not None:
2025-05-07T20:32:25.5591334Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.5591674Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.5591985Z             )
2025-05-07T20:32:25.5592178Z         else:
2025-05-07T20:32:25.5592386Z             scale_ub_tensor = None
2025-05-07T20:32:25.5592637Z     
2025-05-07T20:32:25.5592870Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.5593195Z             op = silu_mul_quant
2025-05-07T20:32:25.5593439Z             if compiled:
2025-05-07T20:32:25.5593692Z                 op = torch.compile(op)
2025-05-07T20:32:25.5594126Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.5594404Z     
2025-05-07T20:32:25.5594597Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.5594891Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.5595186Z     
2025-05-07T20:32:25.5601662Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.5602073Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.5602394Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.5602730Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.5603114Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.5603435Z     
2025-05-07T20:32:25.5603650Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.5603861Z 
2025-05-07T20:32:25.5603979Z moe/activation_test.py:126: 
2025-05-07T20:32:25.5604291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.5604659Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.5605005Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.5605868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.5606684Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.5607272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.5608010Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.5608747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.5609530Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.5610346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:25.5611162Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.5612047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.5612797Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.5613445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.5614005Z     fn()
2025-05-07T20:32:25.5614543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.5615177Z     self.fn.run(
2025-05-07T20:32:25.5615678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.5616246Z     kernel = self.compile(
2025-05-07T20:32:25.5616822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.5617528Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.5617951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.5618195Z 
2025-05-07T20:32:25.5618409Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd6d25490>
2025-05-07T20:32:25.5619587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.5621099Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd6cde0d0>}
2025-05-07T20:32:25.5622621Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.5623813Z context = <triton._C.libtriton.ir.context object at 0x7f3dd6a47c70>
2025-05-07T20:32:25.5624121Z 
2025-05-07T20:32:25.5624293Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.5624853Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.5625350Z                            module_map=module_map)
2025-05-07T20:32:25.5625726Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.5626092Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.5626364Z E       ^
2025-05-07T20:32:25.5626851Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.5627345Z 
2025-05-07T20:32:25.5627793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.5628362Z 
2025-05-07T20:32:25.5628467Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.5628898Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.5629317Z     T=16384,
2025-05-07T20:32:25.5629513Z     D=7168,
2025-05-07T20:32:25.5629857Z     scale_ub=1200.0,
2025-05-07T20:32:25.5630081Z     contiguous=False,
2025-05-07T20:32:25.5630311Z     compiled=False,
2025-05-07T20:32:25.5630515Z )
2025-05-07T20:32:26.1869019Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:26.1870289Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:26.1871773Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:26.1873796Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:26.1875326Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:26.1876860Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.1878298Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:26.1879827Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.1881386Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:26.1883001Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:26.1884351Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:26.1885863Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:26.1887008Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:26.1888131Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:26.1889464Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:26.1890872Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:26.1892089Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:26.1893241Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:26.1894537Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:26.1896029Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:26.1897188Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.1898175Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.1898979Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:26.1900205Z W0507 20:32:26.182805 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.3628357Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:26.3629561Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:26.3631196Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:26.3632830Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:26.3634363Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:26.3635899Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.3637345Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:26.3639125Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.3640681Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:26.3642059Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:26.3643401Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:26.3644732Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:26.3645876Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:26.3646988Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:26.3648327Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:26.3649736Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:26.3650951Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:26.3652247Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:26.3653587Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:26.3655085Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:26.3656244Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.3657236Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.3658048Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:26.3659158Z W0507 20:32:26.358745 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.4986281Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.4986998Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:27.4987317Z 
2025-05-07T20:32:27.4987405Z     @given(
2025-05-07T20:32:27.4987655Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.4988079Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.4988406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.4988746Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.4989120Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.4989624Z     )
2025-05-07T20:32:27.4990145Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.4990627Z     def test_silu_mul_quant(
2025-05-07T20:32:27.4990872Z         self,
2025-05-07T20:32:27.4991070Z         T: int,
2025-05-07T20:32:27.4991272Z         D: int,
2025-05-07T20:32:27.4991490Z         scale_ub: Optional[float],
2025-05-07T20:32:27.4991771Z         contiguous: bool,
2025-05-07T20:32:27.4992017Z         compiled: bool,
2025-05-07T20:32:27.4992241Z     ) -> None:
2025-05-07T20:32:27.4992458Z         torch.manual_seed(2025)
2025-05-07T20:32:27.4992704Z     
2025-05-07T20:32:27.4992974Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.4993333Z     
2025-05-07T20:32:27.4993527Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.4993817Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.4994134Z         x = x_sign * x_clamp
2025-05-07T20:32:27.4994381Z         x0 = x[:, :D]
2025-05-07T20:32:27.4994590Z         x1 = x[:, D:]
2025-05-07T20:32:27.4994803Z     
2025-05-07T20:32:27.4994990Z         if contiguous:
2025-05-07T20:32:27.4995220Z             x0 = x0.contiguous()
2025-05-07T20:32:27.4995482Z             x1 = x1.contiguous()
2025-05-07T20:32:27.4995728Z     
2025-05-07T20:32:27.4995921Z         if scale_ub is not None:
2025-05-07T20:32:27.4996190Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.4996533Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.4996857Z             )
2025-05-07T20:32:27.4997046Z         else:
2025-05-07T20:32:27.4997257Z             scale_ub_tensor = None
2025-05-07T20:32:27.4997520Z     
2025-05-07T20:32:27.4997748Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.4998078Z             op = silu_mul_quant
2025-05-07T20:32:27.4998334Z             if compiled:
2025-05-07T20:32:27.4998578Z                 op = torch.compile(op)
2025-05-07T20:32:27.4998888Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.4999176Z     
2025-05-07T20:32:27.4999363Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.4999538Z 
2025-05-07T20:32:27.4999765Z moe/activation_test.py:117: 
2025-05-07T20:32:27.5000077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.5000426Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.5000706Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.5001451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.5002200Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.5002764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.5003495Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.5004207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.5004775Z     kernel = self.compile(
2025-05-07T20:32:27.5005347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.5006047Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.5006459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.5006700Z 
2025-05-07T20:32:27.5006920Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd6b80a30>
2025-05-07T20:32:27.5008089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.5009595Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd6c12040>}
2025-05-07T20:32:27.5011155Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.5012263Z context = <triton._C.libtriton.ir.context object at 0x7f3dd69e72f0>
2025-05-07T20:32:27.5012567Z 
2025-05-07T20:32:27.5012734Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.5013283Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.5013774Z                            module_map=module_map)
2025-05-07T20:32:27.5014148Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.5014509Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.5014777Z E       ^
2025-05-07T20:32:27.5015268Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.5015764Z 
2025-05-07T20:32:27.5016214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.5016777Z 
2025-05-07T20:32:27.5016886Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.5017310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.5017733Z     T=1,
2025-05-07T20:32:27.5017925Z     D=7168,
2025-05-07T20:32:27.5018110Z     scale_ub=None,
2025-05-07T20:32:27.5018325Z     contiguous=True,
2025-05-07T20:32:27.5018549Z     compiled=True,
2025-05-07T20:32:27.5018750Z )
2025-05-07T20:32:27.5019075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.5019585Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:27.5019858Z 
2025-05-07T20:32:27.5019941Z     @given(
2025-05-07T20:32:27.5020172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.5020496Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.5020818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.5021238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.5021581Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.5021885Z     )
2025-05-07T20:32:27.5022244Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.5022713Z     def test_silu_mul_quant(
2025-05-07T20:32:27.5022962Z         self,
2025-05-07T20:32:27.5023159Z         T: int,
2025-05-07T20:32:27.5023353Z         D: int,
2025-05-07T20:32:27.5023574Z         scale_ub: Optional[float],
2025-05-07T20:32:27.5023848Z         contiguous: bool,
2025-05-07T20:32:27.5024086Z         compiled: bool,
2025-05-07T20:32:27.5024308Z     ) -> None:
2025-05-07T20:32:27.5024521Z         torch.manual_seed(2025)
2025-05-07T20:32:27.5024761Z     
2025-05-07T20:32:27.5025041Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.5025397Z     
2025-05-07T20:32:27.5025583Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.5025884Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.5026200Z         x = x_sign * x_clamp
2025-05-07T20:32:27.5026436Z         x0 = x[:, :D]
2025-05-07T20:32:27.5026652Z         x1 = x[:, D:]
2025-05-07T20:32:27.5026865Z     
2025-05-07T20:32:27.5027044Z         if contiguous:
2025-05-07T20:32:27.5027285Z             x0 = x0.contiguous()
2025-05-07T20:32:27.5027549Z             x1 = x1.contiguous()
2025-05-07T20:32:27.5027790Z     
2025-05-07T20:32:27.5027984Z         if scale_ub is not None:
2025-05-07T20:32:27.5028261Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.5028601Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.5028916Z             )
2025-05-07T20:32:27.5029109Z         else:
2025-05-07T20:32:27.5029408Z             scale_ub_tensor = None
2025-05-07T20:32:27.5029775Z     
2025-05-07T20:32:27.5030010Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.5030337Z             op = silu_mul_quant
2025-05-07T20:32:27.5030591Z             if compiled:
2025-05-07T20:32:27.5030842Z                 op = torch.compile(op)
2025-05-07T20:32:27.5031146Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.5031423Z     
2025-05-07T20:32:27.5031616Z         y_fp8, y_scale = fn()
2025-05-07T20:32:27.5031907Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:27.5032200Z     
2025-05-07T20:32:27.5032439Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.5032791Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:27.5033093Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:27.5033410Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:27.5033783Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.5034115Z     
2025-05-07T20:32:27.5034312Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:27.5034522Z 
2025-05-07T20:32:27.5034622Z moe/activation_test.py:126: 
2025-05-07T20:32:27.5034935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.5035274Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:27.5035613Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:27.5036460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:27.5037272Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:27.5037848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.5038582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.5039314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:27.5040094Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:27.5041004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:27.5041810Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:27.5042592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:27.5043266Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:27.5043904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:27.5044453Z     fn()
2025-05-07T20:32:27.5044985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:27.5045605Z     self.fn.run(
2025-05-07T20:32:27.5046095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.5046664Z     kernel = self.compile(
2025-05-07T20:32:27.5047231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.5047930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.5048349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.5048590Z 
2025-05-07T20:32:27.5048807Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd6b26910>
2025-05-07T20:32:27.5049970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.5051560Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd6c12dc0>}
2025-05-07T20:32:27.5053046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.5054162Z context = <triton._C.libtriton.ir.context object at 0x7f3dd63d4970>
2025-05-07T20:32:27.5054466Z 
2025-05-07T20:32:27.5054642Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.5055187Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.5055682Z                            module_map=module_map)
2025-05-07T20:32:27.5056061Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.5056421Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:27.5056700Z E       ^
2025-05-07T20:32:27.5057193Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.5057686Z 
2025-05-07T20:32:27.5058146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.5058701Z 
2025-05-07T20:32:27.5058805Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.5059230Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.5059654Z     T=4096,
2025-05-07T20:32:27.5059837Z     D=5120,
2025-05-07T20:32:27.5060037Z     scale_ub=None,
2025-05-07T20:32:27.5060254Z     contiguous=False,
2025-05-07T20:32:27.5060482Z     compiled=False,
2025-05-07T20:32:27.5060687Z )
2025-05-07T20:32:28.1802460Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:28.1803681Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:28.1805525Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:28.1807121Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:28.1808646Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:28.1810180Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.1811629Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:28.1813150Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.1814714Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:28.1816087Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:28.1817591Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:28.1818923Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:28.1820062Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:28.1821181Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:28.1822517Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:28.1823942Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:28.1825166Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:28.1826308Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:28.1827599Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:28.1829098Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:28.1830468Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.1831459Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.1832257Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:28.1833367Z W0507 20:32:28.176104 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.8518511Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:28.8519695Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:28.8521210Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:28.8522788Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:28.8524310Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:28.8525841Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.8527483Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:28.8529001Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.8530569Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:28.8531934Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:28.8533306Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:28.8534670Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:28.8535807Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:28.8536927Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:28.8538270Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:28.8539691Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:28.8541335Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:28.8542483Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:28.8543779Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:28.8545268Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:28.8546433Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.8547427Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.8548226Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:28.8549337Z W0507 20:32:28.847853 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.1597904Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.1598525Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:30.1598827Z 
2025-05-07T20:32:30.1598930Z     @given(
2025-05-07T20:32:30.1599161Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.1599758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.1600074Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.1600414Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.1600752Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.1601051Z     )
2025-05-07T20:32:30.1601413Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.1601874Z     def test_silu_mul_quant(
2025-05-07T20:32:30.1602118Z         self,
2025-05-07T20:32:30.1602308Z         T: int,
2025-05-07T20:32:30.1602494Z         D: int,
2025-05-07T20:32:30.1602709Z         scale_ub: Optional[float],
2025-05-07T20:32:30.1602985Z         contiguous: bool,
2025-05-07T20:32:30.1603219Z         compiled: bool,
2025-05-07T20:32:30.1603444Z     ) -> None:
2025-05-07T20:32:30.1603656Z         torch.manual_seed(2025)
2025-05-07T20:32:30.1603898Z     
2025-05-07T20:32:30.1604181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.1604549Z     
2025-05-07T20:32:30.1610601Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.1610929Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.1611258Z         x = x_sign * x_clamp
2025-05-07T20:32:30.1611512Z         x0 = x[:, :D]
2025-05-07T20:32:30.1611727Z         x1 = x[:, D:]
2025-05-07T20:32:30.1611941Z     
2025-05-07T20:32:30.1612133Z         if contiguous:
2025-05-07T20:32:30.1612364Z             x0 = x0.contiguous()
2025-05-07T20:32:30.1612634Z             x1 = x1.contiguous()
2025-05-07T20:32:30.1612884Z     
2025-05-07T20:32:30.1613071Z         if scale_ub is not None:
2025-05-07T20:32:30.1613373Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.1613758Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.1614085Z             )
2025-05-07T20:32:30.1614276Z         else:
2025-05-07T20:32:30.1614490Z             scale_ub_tensor = None
2025-05-07T20:32:30.1614761Z     
2025-05-07T20:32:30.1614993Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.1615327Z             op = silu_mul_quant
2025-05-07T20:32:30.1615754Z             if compiled:
2025-05-07T20:32:30.1616007Z                 op = torch.compile(op)
2025-05-07T20:32:30.1616315Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.1616611Z     
2025-05-07T20:32:30.1616805Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.1616984Z 
2025-05-07T20:32:30.1617085Z moe/activation_test.py:117: 
2025-05-07T20:32:30.1617396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.1617745Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.1618028Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.1618770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.1619517Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.1620083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.1620827Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.1621563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.1622137Z     kernel = self.compile(
2025-05-07T20:32:30.1622714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.1623412Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.1623834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.1624077Z 
2025-05-07T20:32:30.1624298Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd68a6eb0>
2025-05-07T20:32:30.1625470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.1627064Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd6c120d0>}
2025-05-07T20:32:30.1628536Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.1629742Z context = <triton._C.libtriton.ir.context object at 0x7f3dd63fcc30>
2025-05-07T20:32:30.1630048Z 
2025-05-07T20:32:30.1630225Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.1630771Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.1631274Z                            module_map=module_map)
2025-05-07T20:32:30.1631649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.1632004Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.1632275Z E       ^
2025-05-07T20:32:30.1632765Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.1633255Z 
2025-05-07T20:32:30.1633706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.1634259Z 
2025-05-07T20:32:30.1634359Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.1634785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.1635206Z     T=4096,
2025-05-07T20:32:30.1635385Z     D=7168,
2025-05-07T20:32:30.1635577Z     scale_ub=None,
2025-05-07T20:32:30.1635798Z     contiguous=False,
2025-05-07T20:32:30.1636028Z     compiled=False,
2025-05-07T20:32:30.1636246Z )
2025-05-07T20:32:30.1636575Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.1637094Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:30.1637506Z 
2025-05-07T20:32:30.1637584Z     @given(
2025-05-07T20:32:30.1637815Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.1638137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.1638446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.1638788Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.1639123Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.1639410Z     )
2025-05-07T20:32:30.1639771Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.1640236Z     def test_silu_mul_quant(
2025-05-07T20:32:30.1640482Z         self,
2025-05-07T20:32:30.1640671Z         T: int,
2025-05-07T20:32:30.1640866Z         D: int,
2025-05-07T20:32:30.1641087Z         scale_ub: Optional[float],
2025-05-07T20:32:30.1641356Z         contiguous: bool,
2025-05-07T20:32:30.1641597Z         compiled: bool,
2025-05-07T20:32:30.1641823Z     ) -> None:
2025-05-07T20:32:30.1642040Z         torch.manual_seed(2025)
2025-05-07T20:32:30.1642290Z     
2025-05-07T20:32:30.1642563Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.1642914Z     
2025-05-07T20:32:30.1643106Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.1643407Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.1643717Z         x = x_sign * x_clamp
2025-05-07T20:32:30.1643960Z         x0 = x[:, :D]
2025-05-07T20:32:30.1644176Z         x1 = x[:, D:]
2025-05-07T20:32:30.1644375Z     
2025-05-07T20:32:30.1644561Z         if contiguous:
2025-05-07T20:32:30.1644788Z             x0 = x0.contiguous()
2025-05-07T20:32:30.1645039Z             x1 = x1.contiguous()
2025-05-07T20:32:30.1645278Z     
2025-05-07T20:32:30.1645552Z         if scale_ub is not None:
2025-05-07T20:32:30.1645828Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.1646163Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.1646488Z             )
2025-05-07T20:32:30.1646673Z         else:
2025-05-07T20:32:30.1646874Z             scale_ub_tensor = None
2025-05-07T20:32:30.1647125Z     
2025-05-07T20:32:30.1647358Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.1647675Z             op = silu_mul_quant
2025-05-07T20:32:30.1647928Z             if compiled:
2025-05-07T20:32:30.1648175Z                 op = torch.compile(op)
2025-05-07T20:32:30.1648470Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.1648748Z     
2025-05-07T20:32:30.1648943Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.1649109Z 
2025-05-07T20:32:30.1649205Z moe/activation_test.py:117: 
2025-05-07T20:32:30.1649505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.1649859Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.1650147Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.1650884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.1651633Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.1652197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.1652928Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.1653667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.1654258Z     kernel = self.compile(
2025-05-07T20:32:30.1654828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.1655524Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.1655945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.1656186Z 
2025-05-07T20:32:30.1656484Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd6c4de50>
2025-05-07T20:32:30.1657651Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.1659144Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd65f33a0>}
2025-05-07T20:32:30.1660605Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.1661708Z context = <triton._C.libtriton.ir.context object at 0x7f3dd5f551b0>
2025-05-07T20:32:30.1662016Z 
2025-05-07T20:32:30.1662189Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.1662732Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.1663219Z                            module_map=module_map)
2025-05-07T20:32:30.1663594Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.1663954Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.1664208Z E       ^
2025-05-07T20:32:30.1664702Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.1665189Z 
2025-05-07T20:32:30.1665638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.1666193Z 
2025-05-07T20:32:30.1666299Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.1666798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.1667216Z     T=128,
2025-05-07T20:32:30.1667401Z     D=7168,
2025-05-07T20:32:30.1667585Z     scale_ub=None,
2025-05-07T20:32:30.1667802Z     contiguous=False,
2025-05-07T20:32:30.1668032Z     compiled=True,
2025-05-07T20:32:30.1668228Z )
2025-05-07T20:32:30.2432930Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.2433507Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:30.2433794Z 
2025-05-07T20:32:30.2433880Z     @given(
2025-05-07T20:32:30.2434111Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.2434438Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.2434756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.2435097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.2435434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.2435741Z     )
2025-05-07T20:32:30.2436106Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.2436570Z     def test_silu_mul_quant(
2025-05-07T20:32:30.2436822Z         self,
2025-05-07T20:32:30.2437024Z         T: int,
2025-05-07T20:32:30.2437221Z         D: int,
2025-05-07T20:32:30.2437447Z         scale_ub: Optional[float],
2025-05-07T20:32:30.2437728Z         contiguous: bool,
2025-05-07T20:32:30.2437970Z         compiled: bool,
2025-05-07T20:32:30.2438200Z     ) -> None:
2025-05-07T20:32:30.2438421Z         torch.manual_seed(2025)
2025-05-07T20:32:30.2438667Z     
2025-05-07T20:32:30.2438948Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.2439307Z     
2025-05-07T20:32:30.2439504Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.2439799Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.2440119Z         x = x_sign * x_clamp
2025-05-07T20:32:30.2440363Z         x0 = x[:, :D]
2025-05-07T20:32:30.2440586Z         x1 = x[:, D:]
2025-05-07T20:32:30.2440802Z     
2025-05-07T20:32:30.2440989Z         if contiguous:
2025-05-07T20:32:30.2441222Z             x0 = x0.contiguous()
2025-05-07T20:32:30.2441657Z             x1 = x1.contiguous()
2025-05-07T20:32:30.2441902Z     
2025-05-07T20:32:30.2442086Z         if scale_ub is not None:
2025-05-07T20:32:30.2442364Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.2442705Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.2443017Z             )
2025-05-07T20:32:30.2443208Z         else:
2025-05-07T20:32:30.2443416Z             scale_ub_tensor = None
2025-05-07T20:32:30.2443663Z     
2025-05-07T20:32:30.2443890Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.2444215Z             op = silu_mul_quant
2025-05-07T20:32:30.2444465Z             if compiled:
2025-05-07T20:32:30.2444708Z                 op = torch.compile(op)
2025-05-07T20:32:30.2445011Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.2445296Z     
2025-05-07T20:32:30.2445478Z         y_fp8, y_scale = fn()
2025-05-07T20:32:30.2445765Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:30.2446069Z     
2025-05-07T20:32:30.2446300Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.2446652Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:30.2446948Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:30.2447271Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:30.2447642Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.2447962Z     
2025-05-07T20:32:30.2448153Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:30.2448360Z 
2025-05-07T20:32:30.2448458Z moe/activation_test.py:126: 
2025-05-07T20:32:30.2448759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2449096Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:30.2449563Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:30.2450413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:30.2451225Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:30.2451792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.2452522Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.2453257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:30.2454020Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.2454822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:30.2455623Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:30.2456408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:30.2457086Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:30.2457726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:30.2458282Z     fn()
2025-05-07T20:32:30.2458818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:30.2459433Z     self.fn.run(
2025-05-07T20:32:30.2459918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.2460482Z     kernel = self.compile(
2025-05-07T20:32:30.2461042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.2461745Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.2462158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.2462482Z 
2025-05-07T20:32:30.2462703Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd66d41f0>
2025-05-07T20:32:30.2463873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.2465395Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd6791700>}
2025-05-07T20:32:30.2466872Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.2467990Z context = <triton._C.libtriton.ir.context object at 0x7f3dd5db6d30>
2025-05-07T20:32:30.2468295Z 
2025-05-07T20:32:30.2468474Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.2469016Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.2469505Z                            module_map=module_map)
2025-05-07T20:32:30.2470011Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.2470369Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:30.2470636Z E       ^
2025-05-07T20:32:30.2471124Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.2471614Z 
2025-05-07T20:32:30.2472064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.2472730Z 
2025-05-07T20:32:30.2472831Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.2473252Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.2473674Z     T=128,
2025-05-07T20:32:30.2473851Z     D=7168,
2025-05-07T20:32:30.2474043Z     scale_ub=None,
2025-05-07T20:32:30.2474255Z     contiguous=False,
2025-05-07T20:32:30.2474472Z     compiled=False,
2025-05-07T20:32:30.2474675Z )
2025-05-07T20:32:30.6599527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.6600106Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:30.6600399Z 
2025-05-07T20:32:30.6600522Z     @given(
2025-05-07T20:32:30.6600761Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.6601098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.6601410Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.6601742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.6602087Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.6602384Z     )
2025-05-07T20:32:30.6602743Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.6603209Z     def test_silu_mul_quant(
2025-05-07T20:32:30.6603449Z         self,
2025-05-07T20:32:30.6603642Z         T: int,
2025-05-07T20:32:30.6603833Z         D: int,
2025-05-07T20:32:30.6604049Z         scale_ub: Optional[float],
2025-05-07T20:32:30.6604324Z         contiguous: bool,
2025-05-07T20:32:30.6604559Z         compiled: bool,
2025-05-07T20:32:30.6604784Z     ) -> None:
2025-05-07T20:32:30.6604997Z         torch.manual_seed(2025)
2025-05-07T20:32:30.6605232Z     
2025-05-07T20:32:30.6605506Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.6605862Z     
2025-05-07T20:32:30.6606044Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.6606333Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.6606655Z         x = x_sign * x_clamp
2025-05-07T20:32:30.6606888Z         x0 = x[:, :D]
2025-05-07T20:32:30.6607106Z         x1 = x[:, D:]
2025-05-07T20:32:30.6607311Z     
2025-05-07T20:32:30.6607655Z         if contiguous:
2025-05-07T20:32:30.6607896Z             x0 = x0.contiguous()
2025-05-07T20:32:30.6608156Z             x1 = x1.contiguous()
2025-05-07T20:32:30.6608391Z     
2025-05-07T20:32:30.6608581Z         if scale_ub is not None:
2025-05-07T20:32:30.6608858Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.6609197Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.6609506Z             )
2025-05-07T20:32:30.6609699Z         else:
2025-05-07T20:32:30.6609909Z             scale_ub_tensor = None
2025-05-07T20:32:30.6610156Z     
2025-05-07T20:32:30.6610379Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.6610703Z             op = silu_mul_quant
2025-05-07T20:32:30.6610951Z             if compiled:
2025-05-07T20:32:30.6611204Z                 op = torch.compile(op)
2025-05-07T20:32:30.6611508Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6611785Z     
2025-05-07T20:32:30.6611981Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.6612148Z 
2025-05-07T20:32:30.6612256Z moe/activation_test.py:117: 
2025-05-07T20:32:30.6612556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6612904Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.6613194Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6613933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.6614676Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.6615252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.6615993Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.6616829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.6617392Z     kernel = self.compile(
2025-05-07T20:32:30.6617969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.6618668Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.6619073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6619319Z 
2025-05-07T20:32:30.6619531Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd6be2c40>
2025-05-07T20:32:30.6620700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.6622215Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd6143310>}
2025-05-07T20:32:30.6623695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.6624801Z context = <triton._C.libtriton.ir.context object at 0x7f3dd5d69cb0>
2025-05-07T20:32:30.6625112Z 
2025-05-07T20:32:30.6625281Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.6625832Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.6626322Z                            module_map=module_map)
2025-05-07T20:32:30.6626690Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.6627052Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.6627317Z E       ^
2025-05-07T20:32:30.6627806Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.6628300Z 
2025-05-07T20:32:30.6628833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.6629398Z 
2025-05-07T20:32:30.6629501Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.6630074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.6630492Z     T=4096,
2025-05-07T20:32:30.6630678Z     D=5120,
2025-05-07T20:32:30.6630872Z     scale_ub=1200.0,
2025-05-07T20:32:30.6631092Z     contiguous=True,
2025-05-07T20:32:30.6631313Z     compiled=False,
2025-05-07T20:32:30.6631518Z )
2025-05-07T20:32:30.6631841Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:30.6632359Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:30.6632652Z 
2025-05-07T20:32:30.6632734Z     @given(
2025-05-07T20:32:30.6632957Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:30.6633272Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:30.6633592Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:30.6633932Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:30.6634262Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:30.6634557Z     )
2025-05-07T20:32:30.6634914Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:30.6635369Z     def test_silu_mul_quant(
2025-05-07T20:32:30.6635612Z         self,
2025-05-07T20:32:30.6635830Z         T: int,
2025-05-07T20:32:30.6636025Z         D: int,
2025-05-07T20:32:30.6636235Z         scale_ub: Optional[float],
2025-05-07T20:32:30.6636509Z         contiguous: bool,
2025-05-07T20:32:30.6636748Z         compiled: bool,
2025-05-07T20:32:30.6636961Z     ) -> None:
2025-05-07T20:32:30.6637269Z         torch.manual_seed(2025)
2025-05-07T20:32:30.6637514Z     
2025-05-07T20:32:30.6637783Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:30.6638146Z     
2025-05-07T20:32:30.6638345Z         x_sign = torch.sign(x)
2025-05-07T20:32:30.6638638Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:30.6638958Z         x = x_sign * x_clamp
2025-05-07T20:32:30.6639201Z         x0 = x[:, :D]
2025-05-07T20:32:30.6639421Z         x1 = x[:, D:]
2025-05-07T20:32:30.6639626Z     
2025-05-07T20:32:30.6639820Z         if contiguous:
2025-05-07T20:32:30.6640059Z             x0 = x0.contiguous()
2025-05-07T20:32:30.6640319Z             x1 = x1.contiguous()
2025-05-07T20:32:30.6640572Z     
2025-05-07T20:32:30.6640767Z         if scale_ub is not None:
2025-05-07T20:32:30.6641040Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:30.6641383Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:30.6641699Z             )
2025-05-07T20:32:30.6641891Z         else:
2025-05-07T20:32:30.6642097Z             scale_ub_tensor = None
2025-05-07T20:32:30.6642349Z     
2025-05-07T20:32:30.6642571Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:30.6642902Z             op = silu_mul_quant
2025-05-07T20:32:30.6643152Z             if compiled:
2025-05-07T20:32:30.6643393Z                 op = torch.compile(op)
2025-05-07T20:32:30.6643693Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6643972Z     
2025-05-07T20:32:30.6644158Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:30.6644322Z 
2025-05-07T20:32:30.6644417Z moe/activation_test.py:117: 
2025-05-07T20:32:30.6644718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6645057Z moe/activation_test.py:115: in fn
2025-05-07T20:32:30.6645336Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:30.6651328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:30.6652153Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:30.6652848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:30.6653594Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:30.6654301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:30.6654872Z     kernel = self.compile(
2025-05-07T20:32:30.6655439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:30.6656140Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:30.6656556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:30.6656797Z 
2025-05-07T20:32:30.6657017Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd5d88400>
2025-05-07T20:32:30.6658194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:30.6659699Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd60f3c10>}
2025-05-07T20:32:30.6661168Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:30.6662276Z context = <triton._C.libtriton.ir.context object at 0x7f3dd59068f0>
2025-05-07T20:32:30.6662580Z 
2025-05-07T20:32:30.6662749Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:30.6663298Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:30.6663877Z                            module_map=module_map)
2025-05-07T20:32:30.6664265Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:30.6664623Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:30.6664890Z E       ^
2025-05-07T20:32:30.6665382Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:30.6665869Z 
2025-05-07T20:32:30.6666319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:30.6666877Z 
2025-05-07T20:32:30.6666982Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:30.6667408Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:30.6667828Z     T=1,
2025-05-07T20:32:30.6668003Z     D=5120,
2025-05-07T20:32:30.6668199Z     scale_ub=None,
2025-05-07T20:32:30.6668418Z     contiguous=True,
2025-05-07T20:32:30.6668637Z     compiled=True,
2025-05-07T20:32:30.6668841Z )
2025-05-07T20:32:31.1819829Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:31.1820991Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:31.1822486Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:31.1824119Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:31.1825839Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:31.1827385Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.1828831Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:31.1830420Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.1831979Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:31.1833357Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:31.1834686Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:31.1836011Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:31.1837145Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:31.1838259Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:31.1839728Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:31.1841136Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:31.1842361Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:31.1843502Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:31.1844842Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:31.1846343Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:31.1847493Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.1848482Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.1849280Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:31.1850393Z W0507 20:32:31.177882 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.3719637Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:31.3721912Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:31.3724285Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:31.3725849Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:31.3727372Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:31.3728902Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.3730339Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:31.3731845Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.3733405Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:31.3734903Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:31.3736250Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:31.3737577Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:31.3738701Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:31.3739817Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:31.3741164Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:31.3742572Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:31.3743789Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:31.3744927Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:31.3746222Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:31.3747808Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:31.3748963Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.3750069Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.3750864Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:31.3751971Z W0507 20:32:31.367869 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.8745161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.8745736Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:31.8746021Z 
2025-05-07T20:32:31.8746115Z     @given(
2025-05-07T20:32:31.8746349Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.8746672Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.8746976Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.8747308Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.8747640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.8747928Z     )
2025-05-07T20:32:31.8748285Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.8748747Z     def test_silu_mul_quant(
2025-05-07T20:32:31.8748984Z         self,
2025-05-07T20:32:31.8749178Z         T: int,
2025-05-07T20:32:31.8749370Z         D: int,
2025-05-07T20:32:31.8749581Z         scale_ub: Optional[float],
2025-05-07T20:32:31.8750166Z         contiguous: bool,
2025-05-07T20:32:31.8750406Z         compiled: bool,
2025-05-07T20:32:31.8750627Z     ) -> None:
2025-05-07T20:32:31.8750846Z         torch.manual_seed(2025)
2025-05-07T20:32:31.8751088Z     
2025-05-07T20:32:31.8751356Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.8751710Z     
2025-05-07T20:32:31.8751902Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.8752192Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.8752505Z         x = x_sign * x_clamp
2025-05-07T20:32:31.8752747Z         x0 = x[:, :D]
2025-05-07T20:32:31.8752957Z         x1 = x[:, D:]
2025-05-07T20:32:31.8753163Z     
2025-05-07T20:32:31.8753346Z         if contiguous:
2025-05-07T20:32:31.8753578Z             x0 = x0.contiguous()
2025-05-07T20:32:31.8753830Z             x1 = x1.contiguous()
2025-05-07T20:32:31.8754069Z     
2025-05-07T20:32:31.8754254Z         if scale_ub is not None:
2025-05-07T20:32:31.8754533Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.8754879Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.8755188Z             )
2025-05-07T20:32:31.8755375Z         else:
2025-05-07T20:32:31.8755576Z             scale_ub_tensor = None
2025-05-07T20:32:31.8755826Z     
2025-05-07T20:32:31.8756049Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.8756368Z             op = silu_mul_quant
2025-05-07T20:32:31.8756623Z             if compiled:
2025-05-07T20:32:31.8756866Z                 op = torch.compile(op)
2025-05-07T20:32:31.8757162Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.8757442Z     
2025-05-07T20:32:31.8757632Z         y_fp8, y_scale = fn()
2025-05-07T20:32:31.8757908Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:31.8758206Z     
2025-05-07T20:32:31.8758439Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.8758776Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:31.8759091Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:31.8759413Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:31.8759898Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.8760223Z     
2025-05-07T20:32:31.8760424Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:31.8760625Z 
2025-05-07T20:32:31.8760728Z moe/activation_test.py:126: 
2025-05-07T20:32:31.8761021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8761366Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:31.8761694Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.8762533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:31.8763341Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:31.8763916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.8764646Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.8765382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:31.8766157Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.8766962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:31.8767761Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.8768535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:31.8769211Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:31.8769931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:31.8770478Z     fn()
2025-05-07T20:32:31.8771009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:31.8771624Z     self.fn.run(
2025-05-07T20:32:31.8772110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.8772665Z     kernel = self.compile(
2025-05-07T20:32:31.8773228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.8773925Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.8774333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.8774579Z 
2025-05-07T20:32:31.8774790Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd5d2cbe0>
2025-05-07T20:32:31.8775968Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.8777490Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd5e9e3a0>}
2025-05-07T20:32:31.8778955Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.8780057Z context = <triton._C.libtriton.ir.context object at 0x7f3dd5cb9fb0>
2025-05-07T20:32:31.8780363Z 
2025-05-07T20:32:31.8780531Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.8781078Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.8781575Z                            module_map=module_map)
2025-05-07T20:32:31.8781943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.8782420Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:31.8782688Z E       ^
2025-05-07T20:32:31.8783438Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.8783932Z 
2025-05-07T20:32:31.8784379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.8784939Z 
2025-05-07T20:32:31.8785040Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.8785463Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.8785881Z     T=2048,
2025-05-07T20:32:31.8786064Z     D=5120,
2025-05-07T20:32:31.8786251Z     scale_ub=None,
2025-05-07T20:32:31.8786455Z     contiguous=True,
2025-05-07T20:32:31.8786681Z     compiled=True,
2025-05-07T20:32:31.8786879Z )
2025-05-07T20:32:32.3517187Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:32.3518356Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:32.3519838Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:32.3521406Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:32.3522914Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:32.3524609Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.3526035Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.3527546Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.3529099Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:32.3530480Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:32.3531809Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:32.3533139Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:32.3534267Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:32.3535377Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:32.3536836Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:32.3538250Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:32.3539465Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:32.3540599Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:32.3541890Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:32.3543388Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:32.3544593Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.3545577Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.3546375Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:32.3547480Z W0507 20:32:32.347668 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.5393063Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:32.5394431Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:32.5395916Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:32.5397474Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:32.5398988Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:32.5400517Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.5401944Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.5403452Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.5405051Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:32.5406421Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:32.5407927Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:32.5409254Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:32.5410390Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:32.5411497Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:32.5412837Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:32.5414290Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:32.5415505Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:32.5416642Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:32.5417927Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:32.5419544Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:32.5420702Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.5421685Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.5422479Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:32.5423584Z W0507 20:32:32.535313 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.0414577Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.0415162Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:33.0415468Z 
2025-05-07T20:32:33.0415550Z     @given(
2025-05-07T20:32:33.0415792Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.0416119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.0416433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.0416775Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.0417115Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.0417415Z     )
2025-05-07T20:32:33.0417771Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.0418238Z     def test_silu_mul_quant(
2025-05-07T20:32:33.0418480Z         self,
2025-05-07T20:32:33.0418669Z         T: int,
2025-05-07T20:32:33.0418867Z         D: int,
2025-05-07T20:32:33.0419085Z         scale_ub: Optional[float],
2025-05-07T20:32:33.0419354Z         contiguous: bool,
2025-05-07T20:32:33.0419595Z         compiled: bool,
2025-05-07T20:32:33.0419821Z     ) -> None:
2025-05-07T20:32:33.0420027Z         torch.manual_seed(2025)
2025-05-07T20:32:33.0420434Z     
2025-05-07T20:32:33.0420718Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.0421069Z     
2025-05-07T20:32:33.0421264Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.0421562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.0421897Z         x = x_sign * x_clamp
2025-05-07T20:32:33.0422139Z         x0 = x[:, :D]
2025-05-07T20:32:33.0422350Z         x1 = x[:, D:]
2025-05-07T20:32:33.0422549Z     
2025-05-07T20:32:33.0422732Z         if contiguous:
2025-05-07T20:32:33.0422958Z             x0 = x0.contiguous()
2025-05-07T20:32:33.0423216Z             x1 = x1.contiguous()
2025-05-07T20:32:33.0423453Z     
2025-05-07T20:32:33.0423645Z         if scale_ub is not None:
2025-05-07T20:32:33.0423919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.0424263Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.0424581Z             )
2025-05-07T20:32:33.0424768Z         else:
2025-05-07T20:32:33.0424980Z             scale_ub_tensor = None
2025-05-07T20:32:33.0425232Z     
2025-05-07T20:32:33.0425460Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.0425782Z             op = silu_mul_quant
2025-05-07T20:32:33.0426031Z             if compiled:
2025-05-07T20:32:33.0426275Z                 op = torch.compile(op)
2025-05-07T20:32:33.0426572Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.0426851Z     
2025-05-07T20:32:33.0427035Z         y_fp8, y_scale = fn()
2025-05-07T20:32:33.0427320Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:33.0427620Z     
2025-05-07T20:32:33.0427849Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.0428191Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:33.0428613Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:33.0428926Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:33.0429303Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.0429621Z     
2025-05-07T20:32:33.0429899Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:33.0430107Z 
2025-05-07T20:32:33.0430206Z moe/activation_test.py:126: 
2025-05-07T20:32:33.0430509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.0430856Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:33.0431183Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.0432036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:33.0432853Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:33.0433426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.0434168Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.0434908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:33.0435677Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.0436477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:33.0437280Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.0438058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:33.0438739Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:33.0439373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:33.0439931Z     fn()
2025-05-07T20:32:33.0440545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:33.0441161Z     self.fn.run(
2025-05-07T20:32:33.0441646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.0442207Z     kernel = self.compile(
2025-05-07T20:32:33.0442778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.0443466Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.0443879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.0444123Z 
2025-05-07T20:32:33.0444342Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd5f933a0>
2025-05-07T20:32:33.0445516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.0447030Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd5cd69d0>}
2025-05-07T20:32:33.0448501Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.0449607Z context = <triton._C.libtriton.ir.context object at 0x7f3dd5893670>
2025-05-07T20:32:33.0449910Z 
2025-05-07T20:32:33.0450087Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.0450626Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.0451197Z                            module_map=module_map)
2025-05-07T20:32:33.0451569Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.0451939Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:33.0452202Z E       ^
2025-05-07T20:32:33.0452688Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.0453175Z 
2025-05-07T20:32:33.0453626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.0454202Z 
2025-05-07T20:32:33.0454312Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.0454755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.0455170Z     T=128,
2025-05-07T20:32:33.0455354Z     D=5120,
2025-05-07T20:32:33.0455539Z     scale_ub=None,
2025-05-07T20:32:33.0455758Z     contiguous=True,
2025-05-07T20:32:33.0455985Z     compiled=True,
2025-05-07T20:32:33.0456178Z )
2025-05-07T20:32:33.5743297Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:33.5744966Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:33.5746446Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:33.5748022Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:33.5749537Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:33.5751398Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.5752837Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:33.5754354Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.5755915Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:33.5757301Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:33.5758635Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:33.5759961Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:33.5761094Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:33.5762209Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:33.5763720Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:33.5765125Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:33.5766340Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:33.5767471Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:33.5768764Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:33.5770258Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:33.5771415Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.5772401Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.5773194Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:33.5774297Z W0507 20:32:33.570105 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.7628846Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:33.7630266Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:33.7636707Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:33.7638312Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:33.7639844Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:33.7641393Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.7642837Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:33.7644398Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.7645994Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:33.7647552Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:33.7648902Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:33.7650233Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:33.7651382Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:33.7652510Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:33.7653865Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:33.7655280Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:33.7656493Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:33.7657634Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:33.7658940Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:33.7660439Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:33.7661679Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.7662659Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.7663458Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:33.7664572Z W0507 20:32:33.758944 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.5817350Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:34.5818143Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:34.5818536Z 
2025-05-07T20:32:34.5818645Z     @given(
2025-05-07T20:32:34.5818974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:34.5819396Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:34.5819790Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:34.5820214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:34.5820585Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:34.5820882Z     )
2025-05-07T20:32:34.5821252Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:34.5821724Z     def test_silu_mul_quant(
2025-05-07T20:32:34.5821968Z         self,
2025-05-07T20:32:34.5822157Z         T: int,
2025-05-07T20:32:34.5822349Z         D: int,
2025-05-07T20:32:34.5822568Z         scale_ub: Optional[float],
2025-05-07T20:32:34.5822847Z         contiguous: bool,
2025-05-07T20:32:34.5823298Z         compiled: bool,
2025-05-07T20:32:34.5823523Z     ) -> None:
2025-05-07T20:32:34.5823745Z         torch.manual_seed(2025)
2025-05-07T20:32:34.5823996Z     
2025-05-07T20:32:34.5824276Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:34.5824641Z     
2025-05-07T20:32:34.5824838Z         x_sign = torch.sign(x)
2025-05-07T20:32:34.5825135Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:34.5825467Z         x = x_sign * x_clamp
2025-05-07T20:32:34.5825719Z         x0 = x[:, :D]
2025-05-07T20:32:34.5825939Z         x1 = x[:, D:]
2025-05-07T20:32:34.5826152Z     
2025-05-07T20:32:34.5826341Z         if contiguous:
2025-05-07T20:32:34.5826574Z             x0 = x0.contiguous()
2025-05-07T20:32:34.5826843Z             x1 = x1.contiguous()
2025-05-07T20:32:34.5827095Z     
2025-05-07T20:32:34.5827288Z         if scale_ub is not None:
2025-05-07T20:32:34.5827576Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:34.5827933Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:34.5828246Z             )
2025-05-07T20:32:34.5828438Z         else:
2025-05-07T20:32:34.5828646Z             scale_ub_tensor = None
2025-05-07T20:32:34.5828902Z     
2025-05-07T20:32:34.5829128Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.5829455Z             op = silu_mul_quant
2025-05-07T20:32:34.5829829Z             if compiled:
2025-05-07T20:32:34.5830079Z                 op = torch.compile(op)
2025-05-07T20:32:34.5830383Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:34.5830663Z     
2025-05-07T20:32:34.5830849Z         y_fp8, y_scale = fn()
2025-05-07T20:32:34.5831129Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:34.5831425Z     
2025-05-07T20:32:34.5831654Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:34.5831998Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:34.5832303Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:34.5832622Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:34.5832992Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:34.5833438Z     
2025-05-07T20:32:34.5833644Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:34.5833843Z 
2025-05-07T20:32:34.5833942Z moe/activation_test.py:126: 
2025-05-07T20:32:34.5834247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.5834595Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:34.5834924Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:34.5835773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:34.5836589Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:34.5837171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:34.5837906Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:34.5838651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:34.5839420Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:34.5840227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:34.5841018Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:34.5841795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:34.5842473Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:34.5843102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:34.5843738Z     fn()
2025-05-07T20:32:34.5844266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:34.5844936Z     self.fn.run(
2025-05-07T20:32:34.5845413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:34.5845971Z     kernel = self.compile(
2025-05-07T20:32:34.5846539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:34.5847224Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.5847633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:34.5847881Z 
2025-05-07T20:32:34.5848094Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd62be250>
2025-05-07T20:32:34.5849258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:34.5850785Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd5944d30>}
2025-05-07T20:32:34.5852246Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:34.5853342Z context = <triton._C.libtriton.ir.context object at 0x7f3dd53afc70>
2025-05-07T20:32:34.5853646Z 
2025-05-07T20:32:34.5853817Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:34.5854361Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.5854851Z                            module_map=module_map)
2025-05-07T20:32:34.5855225Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.5855590Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:34.5855941Z E       ^
2025-05-07T20:32:34.5856435Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.5856925Z 
2025-05-07T20:32:34.5857381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:34.5857932Z 
2025-05-07T20:32:34.5858040Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:34.5858454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:34.5858871Z     T=4096,
2025-05-07T20:32:34.5859054Z     D=5120,
2025-05-07T20:32:34.5859242Z     scale_ub=None,
2025-05-07T20:32:34.5859450Z     contiguous=True,
2025-05-07T20:32:34.5859671Z     compiled=True,
2025-05-07T20:32:34.5859870Z )
2025-05-07T20:32:35.1192014Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:35.1193479Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:35.1194961Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:35.1196540Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:35.1198055Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:35.1199761Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.1201197Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:35.1202715Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.1204277Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:35.1205713Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:35.1207055Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:35.1208390Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:35.1209531Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:35.1210649Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:35.1212105Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:35.1213522Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:35.1214740Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:35.1215879Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:35.1217171Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:35.1218675Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:35.1219828Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.1220815Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.1221612Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:35.1222718Z W0507 20:32:35.115098 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.3110236Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:35.3111849Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:35.3113320Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:35.3114936Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:35.3116457Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:35.3117993Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.3119432Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:35.3120943Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.3122504Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:35.3123880Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:35.3125370Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:35.3126700Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:35.3127833Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit
2025-05-07T20:32:35.3128954Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:35.3130301Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:35.3131725Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:35.3132939Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit
2025-05-07T20:32:35.3134077Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:35.3135374Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:35.3136955Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:35.3138112Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.3139099Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.3139894Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:35.3141003Z W0507 20:32:35.306991 87499 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.9742627Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.9744042Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.9744744Z 
2025-05-07T20:32:35.9744944Z     @given(
2025-05-07T20:32:35.9745383Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.9745744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.9746061Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.9746402Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.9746742Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.9747064Z     )
2025-05-07T20:32:35.9747425Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.9747888Z     def test_silu_mul_quant(
2025-05-07T20:32:35.9748137Z         self,
2025-05-07T20:32:35.9748335Z         T: int,
2025-05-07T20:32:35.9748526Z         D: int,
2025-05-07T20:32:35.9748751Z         scale_ub: Optional[float],
2025-05-07T20:32:35.9749026Z         contiguous: bool,
2025-05-07T20:32:35.9749269Z         compiled: bool,
2025-05-07T20:32:35.9749500Z     ) -> None:
2025-05-07T20:32:35.9749860Z         torch.manual_seed(2025)
2025-05-07T20:32:35.9750106Z     
2025-05-07T20:32:35.9750565Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.9750928Z     
2025-05-07T20:32:35.9751112Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.9751410Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.9751728Z         x = x_sign * x_clamp
2025-05-07T20:32:35.9751971Z         x0 = x[:, :D]
2025-05-07T20:32:35.9752183Z         x1 = x[:, D:]
2025-05-07T20:32:35.9752386Z     
2025-05-07T20:32:35.9752571Z         if contiguous:
2025-05-07T20:32:35.9752798Z             x0 = x0.contiguous()
2025-05-07T20:32:35.9753058Z             x1 = x1.contiguous()
2025-05-07T20:32:35.9753297Z     
2025-05-07T20:32:35.9753481Z         if scale_ub is not None:
2025-05-07T20:32:35.9753757Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.9754107Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.9754414Z             )
2025-05-07T20:32:35.9754606Z         else:
2025-05-07T20:32:35.9754815Z             scale_ub_tensor = None
2025-05-07T20:32:35.9755071Z     
2025-05-07T20:32:35.9755302Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.9755619Z             op = silu_mul_quant
2025-05-07T20:32:35.9755863Z             if compiled:
2025-05-07T20:32:35.9756110Z                 op = torch.compile(op)
2025-05-07T20:32:35.9756411Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.9756700Z     
2025-05-07T20:32:35.9756884Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.9757169Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.9757467Z     
2025-05-07T20:32:35.9757697Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.9758038Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.9758333Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.9758780Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.9759150Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.9759474Z     
2025-05-07T20:32:35.9759674Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.9759882Z 
2025-05-07T20:32:35.9759980Z moe/activation_test.py:126: 
2025-05-07T20:32:35.9760284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.9760630Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.9760957Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.9761803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.9762615Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.9763182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.9763917Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.9764656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.9765429Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.9766233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:35.9767032Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.9767807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.9768488Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.9769118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.9769671Z     fn()
2025-05-07T20:32:35.9770203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.9770898Z     self.fn.run(
2025-05-07T20:32:35.9771387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.9771949Z     kernel = self.compile(
2025-05-07T20:32:35.9772516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.9773208Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.9773617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.9773858Z 
2025-05-07T20:32:35.9774073Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd53fb1c0>
2025-05-07T20:32:35.9775238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.9776755Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd5944670>}
2025-05-07T20:32:35.9778223Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.9779328Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4e7c0b0>
2025-05-07T20:32:35.9779631Z 
2025-05-07T20:32:35.9779802Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.9780339Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.9780830Z                            module_map=module_map)
2025-05-07T20:32:35.9781284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.9781648Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.9781911Z E       ^
2025-05-07T20:32:35.9782404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.9783076Z 
2025-05-07T20:32:35.9783530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.9784084Z 
2025-05-07T20:32:35.9784184Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.9784607Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.9785054Z     T=16384,
2025-05-07T20:32:35.9785268Z     D=5120,
2025-05-07T20:32:35.9785454Z     scale_ub=None,
2025-05-07T20:32:35.9785661Z     contiguous=True,
2025-05-07T20:32:35.9785883Z     compiled=True,
2025-05-07T20:32:35.9786079Z )
2025-05-07T20:32:36.0205788Z W0507 20:32:36.019138 87499 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:36.0207387Z W0507 20:32:36.019138 87499 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:36.0208850Z W0507 20:32:36.019138 87499 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:36.0209920Z W0507 20:32:36.019138 87499 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:36.0211114Z W0507 20:32:36.019138 87499 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:36.1427696Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.1428489Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:36.1428881Z 
2025-05-07T20:32:36.1429196Z     @given(
2025-05-07T20:32:36.1429480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.1429865Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.1430174Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.1430514Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.1430843Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.1431136Z     )
2025-05-07T20:32:36.1431493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.1431953Z     def test_silu_mul_quant(
2025-05-07T20:32:36.1432196Z         self,
2025-05-07T20:32:36.1432384Z         T: int,
2025-05-07T20:32:36.1432571Z         D: int,
2025-05-07T20:32:36.1432793Z         scale_ub: Optional[float],
2025-05-07T20:32:36.1433067Z         contiguous: bool,
2025-05-07T20:32:36.1433300Z         compiled: bool,
2025-05-07T20:32:36.1433525Z     ) -> None:
2025-05-07T20:32:36.1433743Z         torch.manual_seed(2025)
2025-05-07T20:32:36.1433981Z     
2025-05-07T20:32:36.1434252Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.1434611Z     
2025-05-07T20:32:36.1434796Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.1435135Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.1435453Z         x = x_sign * x_clamp
2025-05-07T20:32:36.1435693Z         x0 = x[:, :D]
2025-05-07T20:32:36.1435909Z         x1 = x[:, D:]
2025-05-07T20:32:36.1436115Z     
2025-05-07T20:32:36.1436297Z         if contiguous:
2025-05-07T20:32:36.1436521Z             x0 = x0.contiguous()
2025-05-07T20:32:36.1436777Z             x1 = x1.contiguous()
2025-05-07T20:32:36.1437013Z     
2025-05-07T20:32:36.1437196Z         if scale_ub is not None:
2025-05-07T20:32:36.1437602Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.1437946Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.1438257Z             )
2025-05-07T20:32:36.1438456Z         else:
2025-05-07T20:32:36.1438667Z             scale_ub_tensor = None
2025-05-07T20:32:36.1438916Z     
2025-05-07T20:32:36.1439146Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.1439467Z             op = silu_mul_quant
2025-05-07T20:32:36.1439714Z             if compiled:
2025-05-07T20:32:36.1439960Z                 op = torch.compile(op)
2025-05-07T20:32:36.1440258Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.1440539Z     
2025-05-07T20:32:36.1440721Z         y_fp8, y_scale = fn()
2025-05-07T20:32:36.1441006Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:36.1441302Z     
2025-05-07T20:32:36.1441528Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.1441875Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:36.1442176Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:36.1442489Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:36.1442865Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.1443184Z     
2025-05-07T20:32:36.1443379Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:36.1443587Z 
2025-05-07T20:32:36.1443684Z moe/activation_test.py:126: 
2025-05-07T20:32:36.1443984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1444334Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:36.1444663Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.1445513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:36.1446326Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:36.1446904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.1447634Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.1448458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:36.1449236Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.1450037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:36.1450840Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.1451624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:36.1452309Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:36.1452950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:36.1453502Z     fn()
2025-05-07T20:32:36.1454045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:36.1454672Z     self.fn.run(
2025-05-07T20:32:36.1455168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.1455737Z     kernel = self.compile(
2025-05-07T20:32:36.1456316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.1457012Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.1457429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.1457673Z 
2025-05-07T20:32:36.1457893Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd4f94e80>
2025-05-07T20:32:36.1459181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.1460701Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd5a87dc0>}
2025-05-07T20:32:36.1462168Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.1463275Z context = <triton._C.libtriton.ir.context object at 0x7f3dd49fa970>
2025-05-07T20:32:36.1463578Z 
2025-05-07T20:32:36.1463751Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.1464302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.1464788Z                            module_map=module_map)
2025-05-07T20:32:36.1470478Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.1470863Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:36.1471141Z E       ^
2025-05-07T20:32:36.1471636Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.1472136Z 
2025-05-07T20:32:36.1472586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.1473147Z 
2025-05-07T20:32:36.1473251Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.1473683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.1474101Z     T=1,
2025-05-07T20:32:36.1474283Z     D=5120,
2025-05-07T20:32:36.1474482Z     scale_ub=1200.0,
2025-05-07T20:32:36.1474704Z     contiguous=True,
2025-05-07T20:32:36.1474936Z     compiled=True,
2025-05-07T20:32:36.1475174Z )
2025-05-07T20:32:36.3186048Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.3187024Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:36.3187360Z 
2025-05-07T20:32:36.3187437Z     @given(
2025-05-07T20:32:36.3187672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.3187991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.3188301Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.3188639Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.3188972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.3189262Z     )
2025-05-07T20:32:36.3189621Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.3190202Z     def test_silu_mul_quant(
2025-05-07T20:32:36.3190447Z         self,
2025-05-07T20:32:36.3190642Z         T: int,
2025-05-07T20:32:36.3190840Z         D: int,
2025-05-07T20:32:36.3191057Z         scale_ub: Optional[float],
2025-05-07T20:32:36.3191329Z         contiguous: bool,
2025-05-07T20:32:36.3191578Z         compiled: bool,
2025-05-07T20:32:36.3191806Z     ) -> None:
2025-05-07T20:32:36.3192011Z         torch.manual_seed(2025)
2025-05-07T20:32:36.3192258Z     
2025-05-07T20:32:36.3192535Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.3192892Z     
2025-05-07T20:32:36.3193085Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.3193377Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.3193690Z         x = x_sign * x_clamp
2025-05-07T20:32:36.3193932Z         x0 = x[:, :D]
2025-05-07T20:32:36.3194147Z         x1 = x[:, D:]
2025-05-07T20:32:36.3194349Z     
2025-05-07T20:32:36.3194527Z         if contiguous:
2025-05-07T20:32:36.3194760Z             x0 = x0.contiguous()
2025-05-07T20:32:36.3195014Z             x1 = x1.contiguous()
2025-05-07T20:32:36.3195391Z     
2025-05-07T20:32:36.3195583Z         if scale_ub is not None:
2025-05-07T20:32:36.3195852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.3196198Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.3196516Z             )
2025-05-07T20:32:36.3196702Z         else:
2025-05-07T20:32:36.3196913Z             scale_ub_tensor = None
2025-05-07T20:32:36.3197170Z     
2025-05-07T20:32:36.3197398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.3197721Z             op = silu_mul_quant
2025-05-07T20:32:36.3197972Z             if compiled:
2025-05-07T20:32:36.3198224Z                 op = torch.compile(op)
2025-05-07T20:32:36.3198525Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.3198812Z     
2025-05-07T20:32:36.3199009Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.3199176Z 
2025-05-07T20:32:36.3199276Z moe/activation_test.py:117: 
2025-05-07T20:32:36.3199582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.3199935Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.3200218Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.3200815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:36.3201416Z     return fn(*args, **kwargs)
2025-05-07T20:32:36.3202124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.3202863Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.3203433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.3204172Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.3204881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.3205451Z     kernel = self.compile(
2025-05-07T20:32:36.3206022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.3206805Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.3207212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.3207457Z 
2025-05-07T20:32:36.3207668Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd556e9d0>
2025-05-07T20:32:36.3208838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.3210340Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd55e9ca0>}
2025-05-07T20:32:36.3211818Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.3212922Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4952e70>
2025-05-07T20:32:36.3213232Z 
2025-05-07T20:32:36.3213398Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.3213945Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.3214435Z                            module_map=module_map)
2025-05-07T20:32:36.3214803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.3215162Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.3215423Z E       ^
2025-05-07T20:32:36.3215907Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.3216481Z 
2025-05-07T20:32:36.3216926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.3217485Z 
2025-05-07T20:32:36.3217591Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.3218012Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.3218423Z     T=1,
2025-05-07T20:32:36.3218604Z     D=5120,
2025-05-07T20:32:36.3218792Z     scale_ub=None,
2025-05-07T20:32:36.3218996Z     contiguous=False,
2025-05-07T20:32:36.3219216Z     compiled=True,
2025-05-07T20:32:36.3219415Z )
2025-05-07T20:32:36.4028334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.4029077Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:36.4029459Z 
2025-05-07T20:32:36.4029563Z     @given(
2025-05-07T20:32:36.4030046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.4030469Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.4030792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.4031127Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.4031465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.4031750Z     )
2025-05-07T20:32:36.4032110Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.4032570Z     def test_silu_mul_quant(
2025-05-07T20:32:36.4032806Z         self,
2025-05-07T20:32:36.4032995Z         T: int,
2025-05-07T20:32:36.4033186Z         D: int,
2025-05-07T20:32:36.4033395Z         scale_ub: Optional[float],
2025-05-07T20:32:36.4033667Z         contiguous: bool,
2025-05-07T20:32:36.4033907Z         compiled: bool,
2025-05-07T20:32:36.4034123Z     ) -> None:
2025-05-07T20:32:36.4034338Z         torch.manual_seed(2025)
2025-05-07T20:32:36.4034577Z     
2025-05-07T20:32:36.4034848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.4035257Z     
2025-05-07T20:32:36.4035445Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.4035733Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.4036043Z         x = x_sign * x_clamp
2025-05-07T20:32:36.4036447Z         x0 = x[:, :D]
2025-05-07T20:32:36.4036665Z         x1 = x[:, D:]
2025-05-07T20:32:36.4036864Z     
2025-05-07T20:32:36.4037046Z         if contiguous:
2025-05-07T20:32:36.4037272Z             x0 = x0.contiguous()
2025-05-07T20:32:36.4037527Z             x1 = x1.contiguous()
2025-05-07T20:32:36.4037770Z     
2025-05-07T20:32:36.4037952Z         if scale_ub is not None:
2025-05-07T20:32:36.4038220Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.4038562Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.4038887Z             )
2025-05-07T20:32:36.4039073Z         else:
2025-05-07T20:32:36.4039277Z             scale_ub_tensor = None
2025-05-07T20:32:36.4039531Z     
2025-05-07T20:32:36.4039758Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.4040085Z             op = silu_mul_quant
2025-05-07T20:32:36.4040334Z             if compiled:
2025-05-07T20:32:36.4040575Z                 op = torch.compile(op)
2025-05-07T20:32:36.4040878Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.4041159Z     
2025-05-07T20:32:36.4041346Z         y_fp8, y_scale = fn()
2025-05-07T20:32:36.4041628Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:36.4041929Z     
2025-05-07T20:32:36.4042160Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.4042495Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:36.4042795Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:36.4043116Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:36.4043480Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.4043802Z     
2025-05-07T20:32:36.4043998Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:36.4044324Z 
2025-05-07T20:32:36.4044428Z moe/activation_test.py:126: 
2025-05-07T20:32:36.4044723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4045074Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:36.4045411Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:36.4046252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:36.4047065Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:36.4047633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.4048361Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.4049088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:36.4049861Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.4050675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:36.4051468Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:36.4052262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:36.4052946Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:36.4053581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:36.4054132Z     fn()
2025-05-07T20:32:36.4054663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:36.4055283Z     self.fn.run(
2025-05-07T20:32:36.4055770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.4056335Z     kernel = self.compile(
2025-05-07T20:32:36.4057016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.4057710Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.4058123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.4058364Z 
2025-05-07T20:32:36.4058576Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd4feda60>
2025-05-07T20:32:36.4059746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.4061250Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd59ebdc0>}
2025-05-07T20:32:36.4062728Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.4063837Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4421ab0>
2025-05-07T20:32:36.4064146Z 
2025-05-07T20:32:36.4064315Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.4064865Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.4065361Z                            module_map=module_map)
2025-05-07T20:32:36.4065728Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.4066087Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:36.4066358Z E       ^
2025-05-07T20:32:36.4066840Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.4067440Z 
2025-05-07T20:32:36.4067890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.4068445Z 
2025-05-07T20:32:36.4068560Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.4068978Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.4069396Z     T=1,
2025-05-07T20:32:36.4069572Z     D=5120,
2025-05-07T20:32:36.4069823Z     scale_ub=None,
2025-05-07T20:32:36.4070029Z     contiguous=True,
2025-05-07T20:32:36.4070253Z     compiled=False,
2025-05-07T20:32:36.4070450Z )
2025-05-07T20:32:36.7793534Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.7794336Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:36.7794714Z 
2025-05-07T20:32:36.7794828Z     @given(
2025-05-07T20:32:36.7795147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.7795511Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.7795821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.7796163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.7796492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.7796781Z     )
2025-05-07T20:32:36.7797138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.7797599Z     def test_silu_mul_quant(
2025-05-07T20:32:36.7797841Z         self,
2025-05-07T20:32:36.7798030Z         T: int,
2025-05-07T20:32:36.7798218Z         D: int,
2025-05-07T20:32:36.7798435Z         scale_ub: Optional[float],
2025-05-07T20:32:36.7798705Z         contiguous: bool,
2025-05-07T20:32:36.7798938Z         compiled: bool,
2025-05-07T20:32:36.7799159Z     ) -> None:
2025-05-07T20:32:36.7799367Z         torch.manual_seed(2025)
2025-05-07T20:32:36.7799603Z     
2025-05-07T20:32:36.7799877Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.7800230Z     
2025-05-07T20:32:36.7800411Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.7800873Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.7801192Z         x = x_sign * x_clamp
2025-05-07T20:32:36.7801431Z         x0 = x[:, :D]
2025-05-07T20:32:36.7801642Z         x1 = x[:, D:]
2025-05-07T20:32:36.7801847Z     
2025-05-07T20:32:36.7802029Z         if contiguous:
2025-05-07T20:32:36.7802252Z             x0 = x0.contiguous()
2025-05-07T20:32:36.7802512Z             x1 = x1.contiguous()
2025-05-07T20:32:36.7802757Z     
2025-05-07T20:32:36.7802945Z         if scale_ub is not None:
2025-05-07T20:32:36.7803218Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.7803560Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.7803870Z             )
2025-05-07T20:32:36.7804059Z         else:
2025-05-07T20:32:36.7804270Z             scale_ub_tensor = None
2025-05-07T20:32:36.7804518Z     
2025-05-07T20:32:36.7804743Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.7805065Z             op = silu_mul_quant
2025-05-07T20:32:36.7805317Z             if compiled:
2025-05-07T20:32:36.7805562Z                 op = torch.compile(op)
2025-05-07T20:32:36.7805861Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.7806144Z     
2025-05-07T20:32:36.7806328Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.7806496Z 
2025-05-07T20:32:36.7806592Z moe/activation_test.py:117: 
2025-05-07T20:32:36.7806888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.7807229Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.7807508Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.7808246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.7809112Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.7809670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.7810402Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.7811109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.7811670Z     kernel = self.compile(
2025-05-07T20:32:36.7812239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.7812938Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.7813348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.7813586Z 
2025-05-07T20:32:36.7813796Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd5095160>
2025-05-07T20:32:36.7814967Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.7816477Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd54c78b0>}
2025-05-07T20:32:36.7817947Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.7819052Z context = <triton._C.libtriton.ir.context object at 0x7f3dd47e4c70>
2025-05-07T20:32:36.7819356Z 
2025-05-07T20:32:36.7819523Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.7820065Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.7820556Z                            module_map=module_map)
2025-05-07T20:32:36.7820924Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.7821279Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.7821618Z E       ^
2025-05-07T20:32:36.7822104Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.7822596Z 
2025-05-07T20:32:36.7823041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.7823599Z 
2025-05-07T20:32:36.7823699Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.7824128Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.7824543Z     T=128,
2025-05-07T20:32:36.7824728Z     D=5120,
2025-05-07T20:32:36.7824919Z     scale_ub=None,
2025-05-07T20:32:36.7825129Z     contiguous=False,
2025-05-07T20:32:36.7825360Z     compiled=True,
2025-05-07T20:32:36.7825573Z )
2025-05-07T20:32:36.7825895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.7826420Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:36.7826728Z 
2025-05-07T20:32:36.7826803Z     @given(
2025-05-07T20:32:36.7827032Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.7827359Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.7827666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.7828008Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.7828343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.7828630Z     )
2025-05-07T20:32:36.7828989Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.7829451Z     def test_silu_mul_quant(
2025-05-07T20:32:36.7829802Z         self,
2025-05-07T20:32:36.7829998Z         T: int,
2025-05-07T20:32:36.7830278Z         D: int,
2025-05-07T20:32:36.7830484Z         scale_ub: Optional[float],
2025-05-07T20:32:36.7830755Z         contiguous: bool,
2025-05-07T20:32:36.7830994Z         compiled: bool,
2025-05-07T20:32:36.7831220Z     ) -> None:
2025-05-07T20:32:36.7831426Z         torch.manual_seed(2025)
2025-05-07T20:32:36.7831670Z     
2025-05-07T20:32:36.7831943Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.7832292Z     
2025-05-07T20:32:36.7832480Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.7832770Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.7833080Z         x = x_sign * x_clamp
2025-05-07T20:32:36.7833325Z         x0 = x[:, :D]
2025-05-07T20:32:36.7833537Z         x1 = x[:, D:]
2025-05-07T20:32:36.7833735Z     
2025-05-07T20:32:36.7833914Z         if contiguous:
2025-05-07T20:32:36.7834141Z             x0 = x0.contiguous()
2025-05-07T20:32:36.7834393Z             x1 = x1.contiguous()
2025-05-07T20:32:36.7834631Z     
2025-05-07T20:32:36.7834819Z         if scale_ub is not None:
2025-05-07T20:32:36.7835087Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.7835425Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.7835742Z             )
2025-05-07T20:32:36.7835928Z         else:
2025-05-07T20:32:36.7836126Z             scale_ub_tensor = None
2025-05-07T20:32:36.7836379Z     
2025-05-07T20:32:36.7836605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.7836926Z             op = silu_mul_quant
2025-05-07T20:32:36.7837173Z             if compiled:
2025-05-07T20:32:36.7837414Z                 op = torch.compile(op)
2025-05-07T20:32:36.7837708Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.7837987Z     
2025-05-07T20:32:36.7838175Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.7838339Z 
2025-05-07T20:32:36.7838434Z moe/activation_test.py:117: 
2025-05-07T20:32:36.7838731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.7839080Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.7839364Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.7840026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:36.7840619Z     return fn(*args, **kwargs)
2025-05-07T20:32:36.7841318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.7842048Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.7842609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.7843334Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.7844036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.7844594Z     kernel = self.compile(
2025-05-07T20:32:36.7845164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.7845861Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.7846269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.7846513Z 
2025-05-07T20:32:36.7846724Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd437e280>
2025-05-07T20:32:36.7847885Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.7849383Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd510b5e0>}
2025-05-07T20:32:36.7850842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.7852024Z context = <triton._C.libtriton.ir.context object at 0x7f3dd43b6e70>
2025-05-07T20:32:36.7852329Z 
2025-05-07T20:32:36.7852496Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.7853041Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.7853529Z                            module_map=module_map)
2025-05-07T20:32:36.7853893Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.7854252Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.7854513Z E       ^
2025-05-07T20:32:36.7855001Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.7855545Z 
2025-05-07T20:32:36.7855990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.7856554Z 
2025-05-07T20:32:36.7856657Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.7857083Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.7857493Z     T=128,
2025-05-07T20:32:36.7857672Z     D=7168,
2025-05-07T20:32:36.7857863Z     scale_ub=1200.0,
2025-05-07T20:32:36.7858085Z     contiguous=False,
2025-05-07T20:32:36.7858318Z     compiled=False,
2025-05-07T20:32:36.7858526Z )
2025-05-07T20:32:36.9403018Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.9403878Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:36.9404288Z 
2025-05-07T20:32:36.9404411Z     @given(
2025-05-07T20:32:36.9413680Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.9414346Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.9414992Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.9415402Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.9415738Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.9416220Z     )
2025-05-07T20:32:36.9416589Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.9417070Z     def test_silu_mul_quant(
2025-05-07T20:32:36.9417310Z         self,
2025-05-07T20:32:36.9417506Z         T: int,
2025-05-07T20:32:36.9417705Z         D: int,
2025-05-07T20:32:36.9417916Z         scale_ub: Optional[float],
2025-05-07T20:32:36.9418199Z         contiguous: bool,
2025-05-07T20:32:36.9418444Z         compiled: bool,
2025-05-07T20:32:36.9418660Z     ) -> None:
2025-05-07T20:32:36.9418884Z         torch.manual_seed(2025)
2025-05-07T20:32:36.9419125Z     
2025-05-07T20:32:36.9419402Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.9419764Z     
2025-05-07T20:32:36.9419960Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.9420250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.9420563Z         x = x_sign * x_clamp
2025-05-07T20:32:36.9420812Z         x0 = x[:, :D]
2025-05-07T20:32:36.9421026Z         x1 = x[:, D:]
2025-05-07T20:32:36.9421228Z     
2025-05-07T20:32:36.9421416Z         if contiguous:
2025-05-07T20:32:36.9421643Z             x0 = x0.contiguous()
2025-05-07T20:32:36.9421897Z             x1 = x1.contiguous()
2025-05-07T20:32:36.9422137Z     
2025-05-07T20:32:36.9422328Z         if scale_ub is not None:
2025-05-07T20:32:36.9422600Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.9422948Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.9423264Z             )
2025-05-07T20:32:36.9423448Z         else:
2025-05-07T20:32:36.9423657Z             scale_ub_tensor = None
2025-05-07T20:32:36.9423916Z     
2025-05-07T20:32:36.9424139Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.9424595Z             op = silu_mul_quant
2025-05-07T20:32:36.9424850Z             if compiled:
2025-05-07T20:32:36.9425091Z                 op = torch.compile(op)
2025-05-07T20:32:36.9425398Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.9425682Z     
2025-05-07T20:32:36.9425865Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.9426037Z 
2025-05-07T20:32:36.9426134Z moe/activation_test.py:117: 
2025-05-07T20:32:36.9426436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.9426777Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.9427056Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.9427797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.9428539Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.9429098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.9429928Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.9430642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:36.9431207Z     kernel = self.compile(
2025-05-07T20:32:36.9431769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.9432461Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.9432867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.9433107Z 
2025-05-07T20:32:36.9433320Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd47d7820>
2025-05-07T20:32:36.9434494Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.9436093Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4add430>}
2025-05-07T20:32:36.9437563Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.9438667Z context = <triton._C.libtriton.ir.context object at 0x7f3dd48ac2b0>
2025-05-07T20:32:36.9438972Z 
2025-05-07T20:32:36.9439141Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.9439687Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.9440175Z                            module_map=module_map)
2025-05-07T20:32:36.9440545Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.9440903Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.9441170Z E       ^
2025-05-07T20:32:36.9441667Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.9442156Z 
2025-05-07T20:32:36.9442603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.9443161Z 
2025-05-07T20:32:36.9443260Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.9443682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.9444093Z     T=128,
2025-05-07T20:32:36.9444268Z     D=5120,
2025-05-07T20:32:36.9444454Z     scale_ub=None,
2025-05-07T20:32:36.9444669Z     contiguous=False,
2025-05-07T20:32:36.9444889Z     compiled=False,
2025-05-07T20:32:36.9445092Z )
2025-05-07T20:32:36.9445415Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:36.9446015Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:36.9446304Z 
2025-05-07T20:32:36.9446379Z     @given(
2025-05-07T20:32:36.9446613Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:36.9446924Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:36.9447234Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:36.9447568Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:36.9447904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:36.9448187Z     )
2025-05-07T20:32:36.9448542Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:36.9449002Z     def test_silu_mul_quant(
2025-05-07T20:32:36.9449240Z         self,
2025-05-07T20:32:36.9449430Z         T: int,
2025-05-07T20:32:36.9449627Z         D: int,
2025-05-07T20:32:36.9449840Z         scale_ub: Optional[float],
2025-05-07T20:32:36.9450113Z         contiguous: bool,
2025-05-07T20:32:36.9450361Z         compiled: bool,
2025-05-07T20:32:36.9450576Z     ) -> None:
2025-05-07T20:32:36.9450786Z         torch.manual_seed(2025)
2025-05-07T20:32:36.9451027Z     
2025-05-07T20:32:36.9451300Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:36.9451658Z     
2025-05-07T20:32:36.9451851Z         x_sign = torch.sign(x)
2025-05-07T20:32:36.9452144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:36.9452455Z         x = x_sign * x_clamp
2025-05-07T20:32:36.9452692Z         x0 = x[:, :D]
2025-05-07T20:32:36.9452906Z         x1 = x[:, D:]
2025-05-07T20:32:36.9453108Z     
2025-05-07T20:32:36.9453292Z         if contiguous:
2025-05-07T20:32:36.9453528Z             x0 = x0.contiguous()
2025-05-07T20:32:36.9453781Z             x1 = x1.contiguous()
2025-05-07T20:32:36.9454019Z     
2025-05-07T20:32:36.9454209Z         if scale_ub is not None:
2025-05-07T20:32:36.9454475Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:36.9454821Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:36.9455139Z             )
2025-05-07T20:32:36.9455348Z         else:
2025-05-07T20:32:36.9455578Z             scale_ub_tensor = None
2025-05-07T20:32:36.9455913Z     
2025-05-07T20:32:36.9456139Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:36.9456461Z             op = silu_mul_quant
2025-05-07T20:32:36.9456711Z             if compiled:
2025-05-07T20:32:36.9456958Z                 op = torch.compile(op)
2025-05-07T20:32:36.9457252Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.9457528Z     
2025-05-07T20:32:36.9457717Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:36.9457881Z 
2025-05-07T20:32:36.9457976Z moe/activation_test.py:117: 
2025-05-07T20:32:36.9458272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.9458614Z moe/activation_test.py:115: in fn
2025-05-07T20:32:36.9458897Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:36.9459636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:36.9460381Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:36.9460942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:36.9461663Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:36.9462367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
﻿2025-05-07T20:32:36.9467697Z     kernel = self.compile(
2025-05-07T20:32:36.9468271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:36.9468968Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.9469378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:36.9469799Z 
2025-05-07T20:32:36.9470017Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd54e1bb0>
2025-05-07T20:32:36.9471187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:36.9472682Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4addca0>}
2025-05-07T20:32:36.9474151Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:36.9475273Z context = <triton._C.libtriton.ir.context object at 0x7f3dd48685b0>
2025-05-07T20:32:36.9475575Z 
2025-05-07T20:32:36.9475744Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:36.9476296Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.9476784Z                            module_map=module_map)
2025-05-07T20:32:36.9477150Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.9477505Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.9477764Z E       ^
2025-05-07T20:32:36.9478255Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.9478742Z 
2025-05-07T20:32:36.9479189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:36.9479745Z 
2025-05-07T20:32:36.9479846Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:36.9480264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:36.9480687Z     T=128,
2025-05-07T20:32:36.9480872Z     D=5120,
2025-05-07T20:32:36.9481059Z     scale_ub=1200.0,
2025-05-07T20:32:36.9481281Z     contiguous=True,
2025-05-07T20:32:36.9481498Z     compiled=False,
2025-05-07T20:32:36.9481697Z )
2025-05-07T20:32:37.1757665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1758430Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:37.1758830Z 
2025-05-07T20:32:37.1758940Z     @given(
2025-05-07T20:32:37.1759251Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1759652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1759972Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1760310Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1760650Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1760947Z     )
2025-05-07T20:32:37.1761307Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1761783Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1762030Z         self,
2025-05-07T20:32:37.1762227Z         T: int,
2025-05-07T20:32:37.1762433Z         D: int,
2025-05-07T20:32:37.1762663Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1762936Z         contiguous: bool,
2025-05-07T20:32:37.1763182Z         compiled: bool,
2025-05-07T20:32:37.1763414Z     ) -> None:
2025-05-07T20:32:37.1763631Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1763884Z     
2025-05-07T20:32:37.1764163Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1764660Z     
2025-05-07T20:32:37.1764858Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1765157Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1765476Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1765715Z         x0 = x[:, :D]
2025-05-07T20:32:37.1765936Z         x1 = x[:, D:]
2025-05-07T20:32:37.1766144Z     
2025-05-07T20:32:37.1766407Z         if contiguous:
2025-05-07T20:32:37.1766646Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1766903Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1767144Z     
2025-05-07T20:32:37.1767338Z         if scale_ub is not None:
2025-05-07T20:32:37.1767605Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1767948Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1768261Z             )
2025-05-07T20:32:37.1768447Z         else:
2025-05-07T20:32:37.1768655Z             scale_ub_tensor = None
2025-05-07T20:32:37.1768908Z     
2025-05-07T20:32:37.1769140Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1769468Z             op = silu_mul_quant
2025-05-07T20:32:37.1769728Z             if compiled:
2025-05-07T20:32:37.1769983Z                 op = torch.compile(op)
2025-05-07T20:32:37.1770283Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1770566Z     
2025-05-07T20:32:37.1770766Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1770938Z 
2025-05-07T20:32:37.1771039Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1771343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1771696Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1772013Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1772750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1773497Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1774066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1774803Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1775509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1776086Z     kernel = self.compile(
2025-05-07T20:32:37.1776663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1777455Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1777862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1778105Z 
2025-05-07T20:32:37.1778318Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd4ff09d0>
2025-05-07T20:32:37.1779483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1781003Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd5a4d1f0>}
2025-05-07T20:32:37.1782463Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1783772Z context = <triton._C.libtriton.ir.context object at 0x7f3dd47517f0>
2025-05-07T20:32:37.1784081Z 
2025-05-07T20:32:37.1784249Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1784793Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1785280Z                            module_map=module_map)
2025-05-07T20:32:37.1785653Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1786103Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1786360Z E       ^
2025-05-07T20:32:37.1786847Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1787340Z 
2025-05-07T20:32:37.1787785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1788400Z 
2025-05-07T20:32:37.1788507Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1788931Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1789349Z     T=1,
2025-05-07T20:32:37.1789533Z     D=7168,
2025-05-07T20:32:37.1789802Z     scale_ub=1200.0,
2025-05-07T20:32:37.1790027Z     contiguous=True,
2025-05-07T20:32:37.1790255Z     compiled=True,
2025-05-07T20:32:37.1790460Z )
2025-05-07T20:32:37.1790779Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.1791289Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:37.1791561Z 
2025-05-07T20:32:37.1791644Z     @given(
2025-05-07T20:32:37.1791871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.1792191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.1792507Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.1792840Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.1793177Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.1793471Z     )
2025-05-07T20:32:37.1793825Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.1794290Z     def test_silu_mul_quant(
2025-05-07T20:32:37.1794533Z         self,
2025-05-07T20:32:37.1794728Z         T: int,
2025-05-07T20:32:37.1794922Z         D: int,
2025-05-07T20:32:37.1795158Z         scale_ub: Optional[float],
2025-05-07T20:32:37.1795458Z         contiguous: bool,
2025-05-07T20:32:37.1795699Z         compiled: bool,
2025-05-07T20:32:37.1795920Z     ) -> None:
2025-05-07T20:32:37.1796136Z         torch.manual_seed(2025)
2025-05-07T20:32:37.1796373Z     
2025-05-07T20:32:37.1796645Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.1797002Z     
2025-05-07T20:32:37.1797195Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.1797492Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.1797810Z         x = x_sign * x_clamp
2025-05-07T20:32:37.1798049Z         x0 = x[:, :D]
2025-05-07T20:32:37.1798390Z         x1 = x[:, D:]
2025-05-07T20:32:37.1798604Z     
2025-05-07T20:32:37.1798790Z         if contiguous:
2025-05-07T20:32:37.1799019Z             x0 = x0.contiguous()
2025-05-07T20:32:37.1799281Z             x1 = x1.contiguous()
2025-05-07T20:32:37.1799526Z     
2025-05-07T20:32:37.1799715Z         if scale_ub is not None:
2025-05-07T20:32:37.1799991Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.1800331Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.1800641Z             )
2025-05-07T20:32:37.1800829Z         else:
2025-05-07T20:32:37.1801039Z             scale_ub_tensor = None
2025-05-07T20:32:37.1801288Z     
2025-05-07T20:32:37.1801519Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.1801846Z             op = silu_mul_quant
2025-05-07T20:32:37.1802092Z             if compiled:
2025-05-07T20:32:37.1802338Z                 op = torch.compile(op)
2025-05-07T20:32:37.1802648Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1802925Z     
2025-05-07T20:32:37.1803115Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.1803280Z 
2025-05-07T20:32:37.1803382Z moe/activation_test.py:117: 
2025-05-07T20:32:37.1803681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1804019Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.1804306Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.1804952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.1805548Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.1806258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.1807070Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.1807633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.1808361Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.1809065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.1809630Z     kernel = self.compile(
2025-05-07T20:32:37.1810194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.1810891Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.1811303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.1811543Z 
2025-05-07T20:32:37.1811763Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd55b4340>
2025-05-07T20:32:37.1812932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.1814430Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4723820>}
2025-05-07T20:32:37.1815893Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.1816996Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4810a70>
2025-05-07T20:32:37.1817300Z 
2025-05-07T20:32:37.1817475Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.1818015Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.1818507Z                            module_map=module_map)
2025-05-07T20:32:37.1818886Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.1819323Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.1819585Z E       ^
2025-05-07T20:32:37.1820076Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.1820565Z 
2025-05-07T20:32:37.1821015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.1821571Z 
2025-05-07T20:32:37.1821673Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.1822097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.1822515Z     T=1,
2025-05-07T20:32:37.1822689Z     D=7168,
2025-05-07T20:32:37.1822878Z     scale_ub=1200.0,
2025-05-07T20:32:37.1823100Z     contiguous=False,
2025-05-07T20:32:37.1823321Z     compiled=True,
2025-05-07T20:32:37.1823521Z )
2025-05-07T20:32:37.3467806Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.3468622Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.3469007Z 
2025-05-07T20:32:37.3469111Z     @given(
2025-05-07T20:32:37.3469434Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.3469900Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.3470215Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.3470550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.3471010Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.3471308Z     )
2025-05-07T20:32:37.3471667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.3472127Z     def test_silu_mul_quant(
2025-05-07T20:32:37.3472370Z         self,
2025-05-07T20:32:37.3472622Z         T: int,
2025-05-07T20:32:37.3472816Z         D: int,
2025-05-07T20:32:37.3473030Z         scale_ub: Optional[float],
2025-05-07T20:32:37.3473296Z         contiguous: bool,
2025-05-07T20:32:37.3473538Z         compiled: bool,
2025-05-07T20:32:37.3473758Z     ) -> None:
2025-05-07T20:32:37.3473963Z         torch.manual_seed(2025)
2025-05-07T20:32:37.3474205Z     
2025-05-07T20:32:37.3474477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.3474830Z     
2025-05-07T20:32:37.3475014Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.3475304Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.3475624Z         x = x_sign * x_clamp
2025-05-07T20:32:37.3475857Z         x0 = x[:, :D]
2025-05-07T20:32:37.3476074Z         x1 = x[:, D:]
2025-05-07T20:32:37.3476279Z     
2025-05-07T20:32:37.3476456Z         if contiguous:
2025-05-07T20:32:37.3476687Z             x0 = x0.contiguous()
2025-05-07T20:32:37.3476945Z             x1 = x1.contiguous()
2025-05-07T20:32:37.3477188Z     
2025-05-07T20:32:37.3477375Z         if scale_ub is not None:
2025-05-07T20:32:37.3477646Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.3477987Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.3478302Z             )
2025-05-07T20:32:37.3478496Z         else:
2025-05-07T20:32:37.3478697Z             scale_ub_tensor = None
2025-05-07T20:32:37.3478951Z     
2025-05-07T20:32:37.3479180Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.3479500Z             op = silu_mul_quant
2025-05-07T20:32:37.3479744Z             if compiled:
2025-05-07T20:32:37.3479991Z                 op = torch.compile(op)
2025-05-07T20:32:37.3480292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.3480566Z     
2025-05-07T20:32:37.3480753Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.3480918Z 
2025-05-07T20:32:37.3481023Z moe/activation_test.py:117: 
2025-05-07T20:32:37.3481316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.3481658Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.3481942Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.3482663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.3483425Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.3484137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.3484877Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.3485470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.3492019Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.3492741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.3493326Z     kernel = self.compile(
2025-05-07T20:32:37.3493900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.3494606Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.3495025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.3495287Z 
2025-05-07T20:32:37.3495535Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd476bf70>
2025-05-07T20:32:37.3496696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.3498311Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd46a64c0>}
2025-05-07T20:32:37.3499835Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.3500948Z context = <triton._C.libtriton.ir.context object at 0x7f3dd46b4630>
2025-05-07T20:32:37.3501248Z 
2025-05-07T20:32:37.3501423Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.3501964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.3502455Z                            module_map=module_map)
2025-05-07T20:32:37.3502839Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.3503195Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.3503465Z E       ^
2025-05-07T20:32:37.3503958Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.3504448Z 
2025-05-07T20:32:37.3504901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.3505457Z 
2025-05-07T20:32:37.3505563Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.3505989Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.3506409Z     T=1,
2025-05-07T20:32:37.3506585Z     D=7168,
2025-05-07T20:32:37.3506775Z     scale_ub=None,
2025-05-07T20:32:37.3506995Z     contiguous=False,
2025-05-07T20:32:37.3507213Z     compiled=True,
2025-05-07T20:32:37.3507418Z )
2025-05-07T20:32:37.4646400Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.4647195Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:37.4647587Z 
2025-05-07T20:32:37.4647694Z     @given(
2025-05-07T20:32:37.4648005Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.4648430Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.4648742Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.4649082Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.4649578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.4649868Z     )
2025-05-07T20:32:37.4650232Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.4650704Z     def test_silu_mul_quant(
2025-05-07T20:32:37.4650948Z         self,
2025-05-07T20:32:37.4651147Z         T: int,
2025-05-07T20:32:37.4651344Z         D: int,
2025-05-07T20:32:37.4651558Z         scale_ub: Optional[float],
2025-05-07T20:32:37.4651838Z         contiguous: bool,
2025-05-07T20:32:37.4652083Z         compiled: bool,
2025-05-07T20:32:37.4652307Z     ) -> None:
2025-05-07T20:32:37.4652527Z         torch.manual_seed(2025)
2025-05-07T20:32:37.4652777Z     
2025-05-07T20:32:37.4653047Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.4653414Z     
2025-05-07T20:32:37.4653616Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.4653914Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.4654227Z         x = x_sign * x_clamp
2025-05-07T20:32:37.4654481Z         x0 = x[:, :D]
2025-05-07T20:32:37.4654698Z         x1 = x[:, D:]
2025-05-07T20:32:37.4654905Z     
2025-05-07T20:32:37.4655092Z         if contiguous:
2025-05-07T20:32:37.4655326Z             x0 = x0.contiguous()
2025-05-07T20:32:37.4655587Z             x1 = x1.contiguous()
2025-05-07T20:32:37.4655833Z     
2025-05-07T20:32:37.4656027Z         if scale_ub is not None:
2025-05-07T20:32:37.4656373Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.4658250Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.4658567Z             )
2025-05-07T20:32:37.4658753Z         else:
2025-05-07T20:32:37.4658960Z             scale_ub_tensor = None
2025-05-07T20:32:37.4659220Z     
2025-05-07T20:32:37.4659444Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4660443Z             op = silu_mul_quant
2025-05-07T20:32:37.4660702Z             if compiled:
2025-05-07T20:32:37.4660955Z                 op = torch.compile(op)
2025-05-07T20:32:37.4661260Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.4661554Z     
2025-05-07T20:32:37.4661750Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.4662035Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.4662340Z     
2025-05-07T20:32:37.4662577Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.4662920Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.4663233Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.4663559Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.4663927Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4664255Z     
2025-05-07T20:32:37.4664458Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.4664662Z 
2025-05-07T20:32:37.4664768Z moe/activation_test.py:126: 
2025-05-07T20:32:37.4665088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4665439Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.4665774Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.4666620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.4667424Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.4667999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.4668736Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.4669471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.4670385Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4671279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.4672084Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.4672860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.4673539Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.4674171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.4674725Z     fn()
2025-05-07T20:32:37.4675250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.4675873Z     self.fn.run(
2025-05-07T20:32:37.4676363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.4676928Z     kernel = self.compile(
2025-05-07T20:32:37.4677507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.4678208Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4678627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.4678871Z 
2025-05-07T20:32:37.4679087Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd46e7eb0>
2025-05-07T20:32:37.4680263Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.4681848Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4636040>}
2025-05-07T20:32:37.4683627Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.4684739Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4637f70>
2025-05-07T20:32:37.4685039Z 
2025-05-07T20:32:37.4685208Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.4685752Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4686245Z                            module_map=module_map)
2025-05-07T20:32:37.4686612Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4686978Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.4687245Z E       ^
2025-05-07T20:32:37.4687731Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.4688220Z 
2025-05-07T20:32:37.4688670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4689226Z 
2025-05-07T20:32:37.4689327Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.4689754Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.4690160Z     T=1,
2025-05-07T20:32:37.4690336Z     D=5120,
2025-05-07T20:32:37.4690527Z     scale_ub=1200.0,
2025-05-07T20:32:37.4690739Z     contiguous=False,
2025-05-07T20:32:37.4690964Z     compiled=True,
2025-05-07T20:32:37.4691163Z )
2025-05-07T20:32:37.8485189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.8485939Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.8486230Z 
2025-05-07T20:32:37.8486317Z     @given(
2025-05-07T20:32:37.8486564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.8486887Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.8487207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.8487796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.8488141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.8488443Z     )
2025-05-07T20:32:37.8488815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.8489291Z     def test_silu_mul_quant(
2025-05-07T20:32:37.8489538Z         self,
2025-05-07T20:32:37.8489745Z         T: int,
2025-05-07T20:32:37.8489960Z         D: int,
2025-05-07T20:32:37.8490182Z         scale_ub: Optional[float],
2025-05-07T20:32:37.8490464Z         contiguous: bool,
2025-05-07T20:32:37.8490719Z         compiled: bool,
2025-05-07T20:32:37.8490947Z     ) -> None:
2025-05-07T20:32:37.8491167Z         torch.manual_seed(2025)
2025-05-07T20:32:37.8491416Z     
2025-05-07T20:32:37.8491694Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.8492053Z     
2025-05-07T20:32:37.8492249Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.8492551Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.8492882Z         x = x_sign * x_clamp
2025-05-07T20:32:37.8493130Z         x0 = x[:, :D]
2025-05-07T20:32:37.8493355Z         x1 = x[:, D:]
2025-05-07T20:32:37.8493570Z     
2025-05-07T20:32:37.8493769Z         if contiguous:
2025-05-07T20:32:37.8494004Z             x0 = x0.contiguous()
2025-05-07T20:32:37.8494275Z             x1 = x1.contiguous()
2025-05-07T20:32:37.8494618Z     
2025-05-07T20:32:37.8494811Z         if scale_ub is not None:
2025-05-07T20:32:37.8495095Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.8495444Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.8495824Z             )
2025-05-07T20:32:37.8496020Z         else:
2025-05-07T20:32:37.8496284Z             scale_ub_tensor = None
2025-05-07T20:32:37.8496706Z     
2025-05-07T20:32:37.8496945Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.8497273Z             op = silu_mul_quant
2025-05-07T20:32:37.8497536Z             if compiled:
2025-05-07T20:32:37.8497780Z                 op = torch.compile(op)
2025-05-07T20:32:37.8498084Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8498366Z     
2025-05-07T20:32:37.8498565Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.8498808Z 
2025-05-07T20:32:37.8498935Z moe/activation_test.py:117: 
2025-05-07T20:32:37.8499242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8499597Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.8499877Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8500477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.8501084Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.8501793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.8502547Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.8503130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.8503874Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.8504586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.8505160Z     kernel = self.compile(
2025-05-07T20:32:37.8505744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.8506443Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.8506868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8507123Z 
2025-05-07T20:32:37.8507339Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd46ea880>
2025-05-07T20:32:37.8508612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.8510258Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4636f70>}
2025-05-07T20:32:37.8511724Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.8512842Z context = <triton._C.libtriton.ir.context object at 0x7f3dd45db630>
2025-05-07T20:32:37.8513159Z 
2025-05-07T20:32:37.8513328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.8513886Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.8514384Z                            module_map=module_map)
2025-05-07T20:32:37.8514769Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.8515133Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.8515396Z E       ^
2025-05-07T20:32:37.8515892Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.8516390Z 
2025-05-07T20:32:37.8516896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.8517455Z 
2025-05-07T20:32:37.8517561Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.8517982Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.8518449Z     T=1,
2025-05-07T20:32:37.8518633Z     D=5120,
2025-05-07T20:32:37.8518817Z     scale_ub=1200.0,
2025-05-07T20:32:37.8519047Z     contiguous=False,
2025-05-07T20:32:37.8519281Z     compiled=False,
2025-05-07T20:32:37.8519488Z )
2025-05-07T20:32:37.8519831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.8520356Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:37.8520640Z 
2025-05-07T20:32:37.8520726Z     @given(
2025-05-07T20:32:37.8520961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.8521293Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.8521621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.8521966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.8522314Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.8522619Z     )
2025-05-07T20:32:37.8522983Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.8523459Z     def test_silu_mul_quant(
2025-05-07T20:32:37.8523713Z         self,
2025-05-07T20:32:37.8523909Z         T: int,
2025-05-07T20:32:37.8524115Z         D: int,
2025-05-07T20:32:37.8524346Z         scale_ub: Optional[float],
2025-05-07T20:32:37.8524629Z         contiguous: bool,
2025-05-07T20:32:37.8524874Z         compiled: bool,
2025-05-07T20:32:37.8525111Z     ) -> None:
2025-05-07T20:32:37.8525336Z         torch.manual_seed(2025)
2025-05-07T20:32:37.8525583Z     
2025-05-07T20:32:37.8525867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.8526230Z     
2025-05-07T20:32:37.8526423Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.8526724Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.8527048Z         x = x_sign * x_clamp
2025-05-07T20:32:37.8527293Z         x0 = x[:, :D]
2025-05-07T20:32:37.8527523Z         x1 = x[:, D:]
2025-05-07T20:32:37.8527738Z     
2025-05-07T20:32:37.8527923Z         if contiguous:
2025-05-07T20:32:37.8528163Z             x0 = x0.contiguous()
2025-05-07T20:32:37.8528434Z             x1 = x1.contiguous()
2025-05-07T20:32:37.8528679Z     
2025-05-07T20:32:37.8528963Z         if scale_ub is not None:
2025-05-07T20:32:37.8529245Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.8529589Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.8529915Z             )
2025-05-07T20:32:37.8530118Z         else:
2025-05-07T20:32:37.8530335Z             scale_ub_tensor = None
2025-05-07T20:32:37.8530591Z     
2025-05-07T20:32:37.8530829Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.8531163Z             op = silu_mul_quant
2025-05-07T20:32:37.8531415Z             if compiled:
2025-05-07T20:32:37.8531670Z                 op = torch.compile(op)
2025-05-07T20:32:37.8531982Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8532268Z     
2025-05-07T20:32:37.8532469Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.8532640Z 
2025-05-07T20:32:37.8532748Z moe/activation_test.py:117: 
2025-05-07T20:32:37.8533052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8533410Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.8533705Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.8534451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.8535202Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.8535782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.8536571Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.8537278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.8537851Z     kernel = self.compile(
2025-05-07T20:32:37.8538468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.8539169Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.8539584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.8539837Z 
2025-05-07T20:32:37.8540055Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd45e7c40>
2025-05-07T20:32:37.8541232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.8542745Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd46d1700>}
2025-05-07T20:32:37.8544217Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.8545343Z context = <triton._C.libtriton.ir.context object at 0x7f3dd455fef0>
2025-05-07T20:32:37.8545660Z 
2025-05-07T20:32:37.8545836Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.8546398Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.8546894Z                            module_map=module_map)
2025-05-07T20:32:37.8547280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.8547654Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.8547926Z E       ^
2025-05-07T20:32:37.8548424Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.8548923Z 
2025-05-07T20:32:37.8549374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.8550009Z 
2025-05-07T20:32:37.8550122Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.8550671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.8551095Z     T=16384,
2025-05-07T20:32:37.8551296Z     D=5120,
2025-05-07T20:32:37.8551495Z     scale_ub=1200.0,
2025-05-07T20:32:37.8551718Z     contiguous=False,
2025-05-07T20:32:37.8551949Z     compiled=True,
2025-05-07T20:32:37.8552161Z )
2025-05-07T20:32:37.9762889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.9763748Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.9764179Z 
2025-05-07T20:32:37.9764289Z     @given(
2025-05-07T20:32:37.9764622Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.9765060Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.9765437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.9765850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.9766196Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.9766558Z     )
2025-05-07T20:32:37.9766933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.9767414Z     def test_silu_mul_quant(
2025-05-07T20:32:37.9767664Z         self,
2025-05-07T20:32:37.9767873Z         T: int,
2025-05-07T20:32:37.9768097Z         D: int,
2025-05-07T20:32:37.9768360Z         scale_ub: Optional[float],
2025-05-07T20:32:37.9768658Z         contiguous: bool,
2025-05-07T20:32:37.9769185Z         compiled: bool,
2025-05-07T20:32:37.9769434Z     ) -> None:
2025-05-07T20:32:37.9769698Z         torch.manual_seed(2025)
2025-05-07T20:32:37.9769977Z     
2025-05-07T20:32:37.9770252Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.9770673Z     
2025-05-07T20:32:37.9770999Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.9771345Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.9771725Z         x = x_sign * x_clamp
2025-05-07T20:32:37.9772007Z         x0 = x[:, :D]
2025-05-07T20:32:37.9772232Z         x1 = x[:, D:]
2025-05-07T20:32:37.9772482Z     
2025-05-07T20:32:37.9772699Z         if contiguous:
2025-05-07T20:32:37.9772940Z             x0 = x0.contiguous()
2025-05-07T20:32:37.9773227Z             x1 = x1.contiguous()
2025-05-07T20:32:37.9773536Z     
2025-05-07T20:32:37.9773736Z         if scale_ub is not None:
2025-05-07T20:32:37.9774063Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.9774417Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.9774787Z             )
2025-05-07T20:32:37.9774994Z         else:
2025-05-07T20:32:37.9775208Z             scale_ub_tensor = None
2025-05-07T20:32:37.9775487Z     
2025-05-07T20:32:37.9784305Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.9784710Z             op = silu_mul_quant
2025-05-07T20:32:37.9784979Z             if compiled:
2025-05-07T20:32:37.9785239Z                 op = torch.compile(op)
2025-05-07T20:32:37.9785547Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9785837Z     
2025-05-07T20:32:37.9786031Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.9786204Z 
2025-05-07T20:32:37.9786305Z moe/activation_test.py:117: 
2025-05-07T20:32:37.9786616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9786966Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.9787258Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9787868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.9788477Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.9789186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.9790066Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.9790635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.9791563Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.9792286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.9792861Z     kernel = self.compile(
2025-05-07T20:32:37.9793442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.9794148Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.9794558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9794812Z 
2025-05-07T20:32:37.9795026Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd4872fd0>
2025-05-07T20:32:37.9796211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.9797724Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4addd30>}
2025-05-07T20:32:37.9799197Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.9800380Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4054af0>
2025-05-07T20:32:37.9800688Z 
2025-05-07T20:32:37.9800857Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.9801407Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.9801965Z                            module_map=module_map)
2025-05-07T20:32:37.9802339Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.9802707Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.9802974Z E       ^
2025-05-07T20:32:37.9803460Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.9803955Z 
2025-05-07T20:32:37.9804403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.9804969Z 
2025-05-07T20:32:37.9805074Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.9805526Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.9805972Z     T=2048,
2025-05-07T20:32:37.9806167Z     D=7168,
2025-05-07T20:32:37.9806361Z     scale_ub=1200.0,
2025-05-07T20:32:37.9806585Z     contiguous=False,
2025-05-07T20:32:37.9806810Z     compiled=True,
2025-05-07T20:32:37.9807020Z )
2025-05-07T20:32:37.9807341Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.9807862Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:37.9808151Z 
2025-05-07T20:32:37.9808234Z     @given(
2025-05-07T20:32:37.9808545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.9808895Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.9809271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.9809616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.9809961Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.9810262Z     )
2025-05-07T20:32:37.9810628Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.9811087Z     def test_silu_mul_quant(
2025-05-07T20:32:37.9811338Z         self,
2025-05-07T20:32:37.9811544Z         T: int,
2025-05-07T20:32:37.9811745Z         D: int,
2025-05-07T20:32:37.9811971Z         scale_ub: Optional[float],
2025-05-07T20:32:37.9812254Z         contiguous: bool,
2025-05-07T20:32:37.9812592Z         compiled: bool,
2025-05-07T20:32:37.9812823Z     ) -> None:
2025-05-07T20:32:37.9813048Z         torch.manual_seed(2025)
2025-05-07T20:32:37.9813291Z     
2025-05-07T20:32:37.9813571Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.9813936Z     
2025-05-07T20:32:37.9814131Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.9814434Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.9814810Z         x = x_sign * x_clamp
2025-05-07T20:32:37.9815100Z         x0 = x[:, :D]
2025-05-07T20:32:37.9815352Z         x1 = x[:, D:]
2025-05-07T20:32:37.9815612Z     
2025-05-07T20:32:37.9815806Z         if contiguous:
2025-05-07T20:32:37.9816093Z             x0 = x0.contiguous()
2025-05-07T20:32:37.9816371Z             x1 = x1.contiguous()
2025-05-07T20:32:37.9816661Z     
2025-05-07T20:32:37.9816876Z         if scale_ub is not None:
2025-05-07T20:32:37.9817158Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.9817528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.9817894Z             )
2025-05-07T20:32:37.9818097Z         else:
2025-05-07T20:32:37.9818374Z             scale_ub_tensor = None
2025-05-07T20:32:37.9818634Z     
2025-05-07T20:32:37.9818870Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.9819205Z             op = silu_mul_quant
2025-05-07T20:32:37.9819513Z             if compiled:
2025-05-07T20:32:37.9819861Z                 op = torch.compile(op)
2025-05-07T20:32:37.9820208Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9820530Z     
2025-05-07T20:32:37.9820719Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:37.9820893Z 
2025-05-07T20:32:37.9821011Z moe/activation_test.py:117: 
2025-05-07T20:32:37.9821359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9821835Z moe/activation_test.py:115: in fn
2025-05-07T20:32:37.9822134Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.9822778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:37.9823439Z     return fn(*args, **kwargs)
2025-05-07T20:32:37.9824207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:37.9825059Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:37.9825638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.9826370Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.9827088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.9827665Z     kernel = self.compile(
2025-05-07T20:32:37.9828249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.9828955Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.9829377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.9829625Z 
2025-05-07T20:32:37.9829964Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3ff66d0>
2025-05-07T20:32:37.9831276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.9833034Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4059b80>}
2025-05-07T20:32:37.9834883Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.9836055Z context = <triton._C.libtriton.ir.context object at 0x7f3dd453a870>
2025-05-07T20:32:37.9836361Z 
2025-05-07T20:32:37.9836539Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.9837084Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.9837581Z                            module_map=module_map)
2025-05-07T20:32:37.9837958Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.9838320Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.9838589Z E       ^
2025-05-07T20:32:37.9839092Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.9839588Z 
2025-05-07T20:32:37.9840050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.9840611Z 
2025-05-07T20:32:38.2521800Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.2522561Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.2523159Z     T=1,
2025-05-07T20:32:38.2523415Z     D=5120,
2025-05-07T20:32:38.2523672Z     scale_ub=None,
2025-05-07T20:32:38.2523896Z     contiguous=False,
2025-05-07T20:32:38.2524129Z     compiled=False,
2025-05-07T20:32:38.2524337Z )
2025-05-07T20:32:38.2524895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.2525409Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:38.2525684Z 
2025-05-07T20:32:38.2525764Z     @given(
2025-05-07T20:32:38.2526000Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.2526323Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.2526726Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.2527062Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.2527405Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.2527706Z     )
2025-05-07T20:32:38.2528065Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.2528532Z     def test_silu_mul_quant(
2025-05-07T20:32:38.2528778Z         self,
2025-05-07T20:32:38.2528965Z         T: int,
2025-05-07T20:32:38.2529157Z         D: int,
2025-05-07T20:32:38.2529375Z         scale_ub: Optional[float],
2025-05-07T20:32:38.2529647Z         contiguous: bool,
2025-05-07T20:32:38.2529889Z         compiled: bool,
2025-05-07T20:32:38.2530114Z     ) -> None:
2025-05-07T20:32:38.2530322Z         torch.manual_seed(2025)
2025-05-07T20:32:38.2530568Z     
2025-05-07T20:32:38.2530844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.2531205Z     
2025-05-07T20:32:38.2531392Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.2531688Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.2532007Z         x = x_sign * x_clamp
2025-05-07T20:32:38.2532248Z         x0 = x[:, :D]
2025-05-07T20:32:38.2532464Z         x1 = x[:, D:]
2025-05-07T20:32:38.2532671Z     
2025-05-07T20:32:38.2532852Z         if contiguous:
2025-05-07T20:32:38.2533089Z             x0 = x0.contiguous()
2025-05-07T20:32:38.2533348Z             x1 = x1.contiguous()
2025-05-07T20:32:38.2533584Z     
2025-05-07T20:32:38.2533775Z         if scale_ub is not None:
2025-05-07T20:32:38.2534055Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.2534391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.2534711Z             )
2025-05-07T20:32:38.2534907Z         else:
2025-05-07T20:32:38.2535112Z             scale_ub_tensor = None
2025-05-07T20:32:38.2535369Z     
2025-05-07T20:32:38.2535605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.2535933Z             op = silu_mul_quant
2025-05-07T20:32:38.2536180Z             if compiled:
2025-05-07T20:32:38.2536582Z                 op = torch.compile(op)
2025-05-07T20:32:38.2536894Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2537173Z     
2025-05-07T20:32:38.2537366Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.2537533Z 
2025-05-07T20:32:38.2537638Z moe/activation_test.py:117: 
2025-05-07T20:32:38.2537935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2538284Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.2538581Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2539314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.2540062Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.2540631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.2541372Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.2542081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.2542653Z     kernel = self.compile(
2025-05-07T20:32:38.2543227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.2543926Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.2544391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2544639Z 
2025-05-07T20:32:38.2544850Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd45b3fa0>
2025-05-07T20:32:38.2546072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.2547657Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd45925e0>}
2025-05-07T20:32:38.2549123Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.2550416Z context = <triton._C.libtriton.ir.context object at 0x7f3dd44621f0>
2025-05-07T20:32:38.2550724Z 
2025-05-07T20:32:38.2550899Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.2551444Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.2551936Z                            module_map=module_map)
2025-05-07T20:32:38.2552316Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.2552672Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.2552940Z E       ^
2025-05-07T20:32:38.2553438Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.2553926Z 
2025-05-07T20:32:38.2554379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.2554934Z 
2025-05-07T20:32:38.2555037Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.2555474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.2555936Z     T=4096,
2025-05-07T20:32:38.2556122Z     D=7168,
2025-05-07T20:32:38.2556306Z     scale_ub=1200.0,
2025-05-07T20:32:38.2556529Z     contiguous=False,
2025-05-07T20:32:38.2556752Z     compiled=False,
2025-05-07T20:32:38.2556949Z )
2025-05-07T20:32:38.2557272Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.2557794Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:38.2558085Z 
2025-05-07T20:32:38.2558194Z     @given(
2025-05-07T20:32:38.2558503Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.2558825Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.2559139Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.2559472Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.2559808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.2560107Z     )
2025-05-07T20:32:38.2560468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.2560924Z     def test_silu_mul_quant(
2025-05-07T20:32:38.2561167Z         self,
2025-05-07T20:32:38.2561358Z         T: int,
2025-05-07T20:32:38.2561549Z         D: int,
2025-05-07T20:32:38.2561767Z         scale_ub: Optional[float],
2025-05-07T20:32:38.2562044Z         contiguous: bool,
2025-05-07T20:32:38.2562280Z         compiled: bool,
2025-05-07T20:32:38.2562501Z     ) -> None:
2025-05-07T20:32:38.2562714Z         torch.manual_seed(2025)
2025-05-07T20:32:38.2562953Z     
2025-05-07T20:32:38.2563225Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.2563580Z     
2025-05-07T20:32:38.2563764Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.2564059Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.2564376Z         x = x_sign * x_clamp
2025-05-07T20:32:38.2564616Z         x0 = x[:, :D]
2025-05-07T20:32:38.2564884Z         x1 = x[:, D:]
2025-05-07T20:32:38.2565094Z     
2025-05-07T20:32:38.2565272Z         if contiguous:
2025-05-07T20:32:38.2565504Z             x0 = x0.contiguous()
2025-05-07T20:32:38.2565766Z             x1 = x1.contiguous()
2025-05-07T20:32:38.2566010Z     
2025-05-07T20:32:38.2566192Z         if scale_ub is not None:
2025-05-07T20:32:38.2566471Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.2566860Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.2567177Z             )
2025-05-07T20:32:38.2567369Z         else:
2025-05-07T20:32:38.2567587Z             scale_ub_tensor = None
2025-05-07T20:32:38.2567837Z     
2025-05-07T20:32:38.2568072Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.2568394Z             op = silu_mul_quant
2025-05-07T20:32:38.2568642Z             if compiled:
2025-05-07T20:32:38.2568889Z                 op = torch.compile(op)
2025-05-07T20:32:38.2569196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2569474Z     
2025-05-07T20:32:38.2569666Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.2569832Z 
2025-05-07T20:32:38.2569936Z moe/activation_test.py:117: 
2025-05-07T20:32:38.2570237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2570574Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.2570860Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.2571599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.2572341Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.2572908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.2573640Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.2574347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.2574912Z     kernel = self.compile(
2025-05-07T20:32:38.2575510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.2576235Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.2576642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.2576890Z 
2025-05-07T20:32:38.2577101Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd447bdf0>
2025-05-07T20:32:38.2578358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.2579862Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd45928b0>}
2025-05-07T20:32:38.2581337Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.2582442Z context = <triton._C.libtriton.ir.context object at 0x7f3dd42992f0>
2025-05-07T20:32:38.2583010Z 
2025-05-07T20:32:38.2583184Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.2583738Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.2584233Z                            module_map=module_map)
2025-05-07T20:32:38.2584601Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.2584964Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.2585225Z E       ^
2025-05-07T20:32:38.2585709Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.2586301Z 
2025-05-07T20:32:38.2586753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.2587316Z 
2025-05-07T20:32:38.2587419Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.2587851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.2588357Z     T=16384,
2025-05-07T20:32:38.2588551Z     D=7168,
2025-05-07T20:32:38.2588743Z     scale_ub=None,
2025-05-07T20:32:38.2588954Z     contiguous=True,
2025-05-07T20:32:38.2589179Z     compiled=True,
2025-05-07T20:32:38.2589385Z )
2025-05-07T20:32:38.3763444Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.3764188Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:38.3764602Z 
2025-05-07T20:32:38.3764718Z     @given(
2025-05-07T20:32:38.3765041Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.3765489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.3765924Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.3766274Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.3766608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.3766903Z     )
2025-05-07T20:32:38.3767266Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.3767732Z     def test_silu_mul_quant(
2025-05-07T20:32:38.3767979Z         self,
2025-05-07T20:32:38.3768173Z         T: int,
2025-05-07T20:32:38.3768375Z         D: int,
2025-05-07T20:32:38.3768587Z         scale_ub: Optional[float],
2025-05-07T20:32:38.3768860Z         contiguous: bool,
2025-05-07T20:32:38.3769103Z         compiled: bool,
2025-05-07T20:32:38.3769325Z     ) -> None:
2025-05-07T20:32:38.3769543Z         torch.manual_seed(2025)
2025-05-07T20:32:38.3769790Z     
2025-05-07T20:32:38.3770058Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.3770419Z     
2025-05-07T20:32:38.3770614Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.3770903Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.3771225Z         x = x_sign * x_clamp
2025-05-07T20:32:38.3771466Z         x0 = x[:, :D]
2025-05-07T20:32:38.3771681Z         x1 = x[:, D:]
2025-05-07T20:32:38.3771896Z     
2025-05-07T20:32:38.3772111Z         if contiguous:
2025-05-07T20:32:38.3772352Z             x0 = x0.contiguous()
2025-05-07T20:32:38.3772614Z             x1 = x1.contiguous()
2025-05-07T20:32:38.3773016Z     
2025-05-07T20:32:38.3773212Z         if scale_ub is not None:
2025-05-07T20:32:38.3773491Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.3773828Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.3774151Z             )
2025-05-07T20:32:38.3774347Z         else:
2025-05-07T20:32:38.3774559Z             scale_ub_tensor = None
2025-05-07T20:32:38.3774811Z     
2025-05-07T20:32:38.3775049Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.3775376Z             op = silu_mul_quant
2025-05-07T20:32:38.3775627Z             if compiled:
2025-05-07T20:32:38.3775877Z                 op = torch.compile(op)
2025-05-07T20:32:38.3776184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.3776461Z     
2025-05-07T20:32:38.3776656Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.3776823Z 
2025-05-07T20:32:38.3776928Z moe/activation_test.py:117: 
2025-05-07T20:32:38.3777232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.3777581Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.3777872Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.3778464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.3779059Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.3779768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.3780581Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.3781142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.3781873Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.3782645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.3783399Z     kernel = self.compile(
2025-05-07T20:32:38.3783971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.3784685Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.3785105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.3785349Z 
2025-05-07T20:32:38.3785574Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd4213610>
2025-05-07T20:32:38.3786744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.3788252Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd42abc10>}
2025-05-07T20:32:38.3789872Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.3790986Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4326430>
2025-05-07T20:32:38.3791292Z 
2025-05-07T20:32:38.3791463Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.3792020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.3792516Z                            module_map=module_map)
2025-05-07T20:32:38.3792897Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.3793265Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.3793540Z E       ^
2025-05-07T20:32:38.3794036Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.3794528Z 
2025-05-07T20:32:38.3795101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.3795669Z 
2025-05-07T20:32:38.3795773Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.3796202Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.3796626Z     T=4096,
2025-05-07T20:32:38.3796807Z     D=5120,
2025-05-07T20:32:38.3797001Z     scale_ub=None,
2025-05-07T20:32:38.3797213Z     contiguous=False,
2025-05-07T20:32:38.3797433Z     compiled=True,
2025-05-07T20:32:38.3797636Z )
2025-05-07T20:32:38.3797958Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.3798475Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:38.3798770Z 
2025-05-07T20:32:38.3798844Z     @given(
2025-05-07T20:32:38.3799076Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.3799399Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.3799713Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.3807495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.3807864Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.3808182Z     )
2025-05-07T20:32:38.3808560Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.3809051Z     def test_silu_mul_quant(
2025-05-07T20:32:38.3809427Z         self,
2025-05-07T20:32:38.3809632Z         T: int,
2025-05-07T20:32:38.3809849Z         D: int,
2025-05-07T20:32:38.3810088Z         scale_ub: Optional[float],
2025-05-07T20:32:38.3810373Z         contiguous: bool,
2025-05-07T20:32:38.3810636Z         compiled: bool,
2025-05-07T20:32:38.3810880Z     ) -> None:
2025-05-07T20:32:38.3811181Z         torch.manual_seed(2025)
2025-05-07T20:32:38.3811431Z     
2025-05-07T20:32:38.3811717Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.3812085Z     
2025-05-07T20:32:38.3812284Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.3812589Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.3812914Z         x = x_sign * x_clamp
2025-05-07T20:32:38.3813157Z         x0 = x[:, :D]
2025-05-07T20:32:38.3813382Z         x1 = x[:, D:]
2025-05-07T20:32:38.3813599Z     
2025-05-07T20:32:38.3813786Z         if contiguous:
2025-05-07T20:32:38.3814027Z             x0 = x0.contiguous()
2025-05-07T20:32:38.3814299Z             x1 = x1.contiguous()
2025-05-07T20:32:38.3814545Z     
2025-05-07T20:32:38.3814743Z         if scale_ub is not None:
2025-05-07T20:32:38.3815034Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.3815377Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.3815707Z             )
2025-05-07T20:32:38.3815908Z         else:
2025-05-07T20:32:38.3816118Z             scale_ub_tensor = None
2025-05-07T20:32:38.3816381Z     
2025-05-07T20:32:38.3816629Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.3816961Z             op = silu_mul_quant
2025-05-07T20:32:38.3817211Z             if compiled:
2025-05-07T20:32:38.3817465Z                 op = torch.compile(op)
2025-05-07T20:32:38.3817777Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.3818063Z     
2025-05-07T20:32:38.3818257Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.3818430Z 
2025-05-07T20:32:38.3818541Z moe/activation_test.py:117: 
2025-05-07T20:32:38.3818843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.3819194Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.3819490Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.3820081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.3820684Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.3821486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.3822235Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.3822799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.3823534Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.3824245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.3824814Z     kernel = self.compile(
2025-05-07T20:32:38.3825387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.3826090Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.3826506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.3826751Z 
2025-05-07T20:32:38.3826976Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd432e640>
2025-05-07T20:32:38.3828151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.3829660Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd427d820>}
2025-05-07T20:32:38.3831274Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.3832386Z context = <triton._C.libtriton.ir.context object at 0x7f3dd467f570>
2025-05-07T20:32:38.3832735Z 
2025-05-07T20:32:38.3832908Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.3833467Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.3833964Z                            module_map=module_map)
2025-05-07T20:32:38.3834339Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.3834704Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.3834977Z E       ^
2025-05-07T20:32:38.3835503Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.3836023Z 
2025-05-07T20:32:38.3836471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.3837034Z 
2025-05-07T20:32:38.7632266Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.7633242Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.7634090Z     T=4096,
2025-05-07T20:32:38.7634457Z     D=5120,
2025-05-07T20:32:38.7634839Z     scale_ub=1200.0,
2025-05-07T20:32:38.7635292Z     contiguous=False,
2025-05-07T20:32:38.7635624Z     compiled=False,
2025-05-07T20:32:38.7635869Z )
2025-05-07T20:32:38.7636211Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.7636733Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:38.7637033Z 
2025-05-07T20:32:38.7637114Z     @given(
2025-05-07T20:32:38.7637347Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.7637670Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.7637977Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.7638315Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.7638654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.7638949Z     )
2025-05-07T20:32:38.7639317Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.7639787Z     def test_silu_mul_quant(
2025-05-07T20:32:38.7640198Z         self,
2025-05-07T20:32:38.7640401Z         T: int,
2025-05-07T20:32:38.7640601Z         D: int,
2025-05-07T20:32:38.7640813Z         scale_ub: Optional[float],
2025-05-07T20:32:38.7641093Z         contiguous: bool,
2025-05-07T20:32:38.7641342Z         compiled: bool,
2025-05-07T20:32:38.7641577Z     ) -> None:
2025-05-07T20:32:38.7641797Z         torch.manual_seed(2025)
2025-05-07T20:32:38.7642050Z     
2025-05-07T20:32:38.7642334Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.7642689Z     
2025-05-07T20:32:38.7642889Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.7643192Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.7643512Z         x = x_sign * x_clamp
2025-05-07T20:32:38.7643761Z         x0 = x[:, :D]
2025-05-07T20:32:38.7643986Z         x1 = x[:, D:]
2025-05-07T20:32:38.7644194Z     
2025-05-07T20:32:38.7644389Z         if contiguous:
2025-05-07T20:32:38.7644631Z             x0 = x0.contiguous()
2025-05-07T20:32:38.7644899Z             x1 = x1.contiguous()
2025-05-07T20:32:38.7645150Z     
2025-05-07T20:32:38.7645350Z         if scale_ub is not None:
2025-05-07T20:32:38.7645665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.7646026Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.7646352Z             )
2025-05-07T20:32:38.7646548Z         else:
2025-05-07T20:32:38.7646764Z             scale_ub_tensor = None
2025-05-07T20:32:38.7647136Z     
2025-05-07T20:32:38.7647377Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.7647702Z             op = silu_mul_quant
2025-05-07T20:32:38.7647959Z             if compiled:
2025-05-07T20:32:38.7648214Z                 op = torch.compile(op)
2025-05-07T20:32:38.7648519Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.7648861Z     
2025-05-07T20:32:38.7649050Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.7649216Z 
2025-05-07T20:32:38.7649313Z moe/activation_test.py:117: 
2025-05-07T20:32:38.7649619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.7649964Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.7650243Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.7650981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.7651726Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.7652297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.7653020Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.7653729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.7654298Z     kernel = self.compile(
2025-05-07T20:32:38.7654874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.7655572Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.7655984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.7656223Z 
2025-05-07T20:32:38.7656443Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd432b0a0>
2025-05-07T20:32:38.7657610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.7659119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4119280>}
2025-05-07T20:32:38.7660677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.7661783Z context = <triton._C.libtriton.ir.context object at 0x7f3dd41138b0>
2025-05-07T20:32:38.7662086Z 
2025-05-07T20:32:38.7662259Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.7662800Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.7663292Z                            module_map=module_map)
2025-05-07T20:32:38.7663669Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.7664030Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.7664287Z E       ^
2025-05-07T20:32:38.7664777Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.7665264Z 
2025-05-07T20:32:38.7665716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.7666274Z 
2025-05-07T20:32:38.7666374Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.7666826Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.7667250Z     T=4096,
2025-05-07T20:32:38.7667431Z     D=5120,
2025-05-07T20:32:38.7667613Z     scale_ub=1200.0,
2025-05-07T20:32:38.7667839Z     contiguous=False,
2025-05-07T20:32:38.7668061Z     compiled=True,
2025-05-07T20:32:38.7668314Z )
2025-05-07T20:32:38.7668632Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.7669155Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:38.7669445Z 
2025-05-07T20:32:38.7669523Z     @given(
2025-05-07T20:32:38.7669901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.7670266Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.7670579Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.7670921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.7671262Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.7671554Z     )
2025-05-07T20:32:38.7671918Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.7672377Z     def test_silu_mul_quant(
2025-05-07T20:32:38.7672615Z         self,
2025-05-07T20:32:38.7672807Z         T: int,
2025-05-07T20:32:38.7672996Z         D: int,
2025-05-07T20:32:38.7673214Z         scale_ub: Optional[float],
2025-05-07T20:32:38.7673491Z         contiguous: bool,
2025-05-07T20:32:38.7673728Z         compiled: bool,
2025-05-07T20:32:38.7673949Z     ) -> None:
2025-05-07T20:32:38.7674160Z         torch.manual_seed(2025)
2025-05-07T20:32:38.7674399Z     
2025-05-07T20:32:38.7674676Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.7675039Z     
2025-05-07T20:32:38.7675224Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.7675522Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.7675838Z         x = x_sign * x_clamp
2025-05-07T20:32:38.7676078Z         x0 = x[:, :D]
2025-05-07T20:32:38.7676301Z         x1 = x[:, D:]
2025-05-07T20:32:38.7676509Z     
2025-05-07T20:32:38.7676688Z         if contiguous:
2025-05-07T20:32:38.7676922Z             x0 = x0.contiguous()
2025-05-07T20:32:38.7677181Z             x1 = x1.contiguous()
2025-05-07T20:32:38.7677425Z     
2025-05-07T20:32:38.7677613Z         if scale_ub is not None:
2025-05-07T20:32:38.7677889Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.7678229Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.7678539Z             )
2025-05-07T20:32:38.7678727Z         else:
2025-05-07T20:32:38.7678937Z             scale_ub_tensor = None
2025-05-07T20:32:38.7679191Z     
2025-05-07T20:32:38.7679418Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.7679746Z             op = silu_mul_quant
2025-05-07T20:32:38.7679992Z             if compiled:
2025-05-07T20:32:38.7680334Z                 op = torch.compile(op)
2025-05-07T20:32:38.7680639Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.7680914Z     
2025-05-07T20:32:38.7681103Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.7681274Z 
2025-05-07T20:32:38.7681370Z moe/activation_test.py:117: 
2025-05-07T20:32:38.7681670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.7682010Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.7682298Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.7683041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.7683634Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.7684340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.7685081Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.7685662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.7686430Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.7687139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.7687703Z     kernel = self.compile(
2025-05-07T20:32:38.7688407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.7689195Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.7689653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.7689982Z 
2025-05-07T20:32:38.7690223Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd419a100>
2025-05-07T20:32:38.7691573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.7693321Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4119700>}
2025-05-07T20:32:38.7695025Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.7696303Z context = <triton._C.libtriton.ir.context object at 0x7f3dd43ef7b0>
2025-05-07T20:32:38.7696644Z 
2025-05-07T20:32:38.7696834Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.7697452Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.7698009Z                            module_map=module_map)
2025-05-07T20:32:38.7698421Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.7698815Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.7699099Z E       ^
2025-05-07T20:32:38.7699646Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.7700205Z 
2025-05-07T20:32:38.7700846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.7701405Z 
2025-05-07T20:32:39.0459887Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.0460373Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.0460808Z     T=2048,
2025-05-07T20:32:39.0461066Z     D=7168,
2025-05-07T20:32:39.0461346Z     scale_ub=1200.0,
2025-05-07T20:32:39.0461669Z     contiguous=False,
2025-05-07T20:32:39.0462036Z     compiled=False,
2025-05-07T20:32:39.0462313Z )
2025-05-07T20:32:39.0462899Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.0463438Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:39.0463735Z 
2025-05-07T20:32:39.0463814Z     @given(
2025-05-07T20:32:39.0464051Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.0464374Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.0464687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.0465027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.0465374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.0465665Z     )
2025-05-07T20:32:39.0466028Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.0466496Z     def test_silu_mul_quant(
2025-05-07T20:32:39.0466738Z         self,
2025-05-07T20:32:39.0466926Z         T: int,
2025-05-07T20:32:39.0467130Z         D: int,
2025-05-07T20:32:39.0467356Z         scale_ub: Optional[float],
2025-05-07T20:32:39.0467626Z         contiguous: bool,
2025-05-07T20:32:39.0467869Z         compiled: bool,
2025-05-07T20:32:39.0468099Z     ) -> None:
2025-05-07T20:32:39.0468307Z         torch.manual_seed(2025)
2025-05-07T20:32:39.0468559Z     
2025-05-07T20:32:39.0468835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.0469187Z     
2025-05-07T20:32:39.0469444Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.0469866Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.0470177Z         x = x_sign * x_clamp
2025-05-07T20:32:39.0470421Z         x0 = x[:, :D]
2025-05-07T20:32:39.0470638Z         x1 = x[:, D:]
2025-05-07T20:32:39.0470841Z     
2025-05-07T20:32:39.0471027Z         if contiguous:
2025-05-07T20:32:39.0471327Z             x0 = x0.contiguous()
2025-05-07T20:32:39.0471590Z             x1 = x1.contiguous()
2025-05-07T20:32:39.0471826Z     
2025-05-07T20:32:39.0472021Z         if scale_ub is not None:
2025-05-07T20:32:39.0472317Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.0472658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.0472976Z             )
2025-05-07T20:32:39.0473165Z         else:
2025-05-07T20:32:39.0473368Z             scale_ub_tensor = None
2025-05-07T20:32:39.0473627Z     
2025-05-07T20:32:39.0473853Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.0474173Z             op = silu_mul_quant
2025-05-07T20:32:39.0474425Z             if compiled:
2025-05-07T20:32:39.0474670Z                 op = torch.compile(op)
2025-05-07T20:32:39.0474968Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.0475254Z     
2025-05-07T20:32:39.0475445Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.0475617Z 
2025-05-07T20:32:39.0475715Z moe/activation_test.py:117: 
2025-05-07T20:32:39.0476018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0476367Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.0476651Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.0477382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.0478124Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.0478696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.0479423Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.0480131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.0480694Z     kernel = self.compile(
2025-05-07T20:32:39.0481270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.0481960Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.0482496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0482906Z 
2025-05-07T20:32:39.0483132Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd416d250>
2025-05-07T20:32:39.0484301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.0485802Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd409a790>}
2025-05-07T20:32:39.0487269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.0488381Z context = <triton._C.libtriton.ir.context object at 0x7f3dd414f7f0>
2025-05-07T20:32:39.0488683Z 
2025-05-07T20:32:39.0488858Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.0489402Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.0489891Z                            module_map=module_map)
2025-05-07T20:32:39.0490262Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.0490703Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.0490972Z E       ^
2025-05-07T20:32:39.0491471Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.0491966Z 
2025-05-07T20:32:39.0492423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.0493039Z 
2025-05-07T20:32:39.0493145Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.0493570Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.0493990Z     T=1,
2025-05-07T20:32:39.0494170Z     D=7168,
2025-05-07T20:32:39.0494358Z     scale_ub=None,
2025-05-07T20:32:39.0494569Z     contiguous=True,
2025-05-07T20:32:39.0494793Z     compiled=False,
2025-05-07T20:32:39.0494993Z )
2025-05-07T20:32:39.0495318Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.0495828Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:39.0496099Z 
2025-05-07T20:32:39.0496173Z     @given(
2025-05-07T20:32:39.0496400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.0496720Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.0497034Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.0497371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.0497710Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.0498005Z     )
2025-05-07T20:32:39.0498365Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.0498828Z     def test_silu_mul_quant(
2025-05-07T20:32:39.0499073Z         self,
2025-05-07T20:32:39.0499262Z         T: int,
2025-05-07T20:32:39.0499456Z         D: int,
2025-05-07T20:32:39.0499681Z         scale_ub: Optional[float],
2025-05-07T20:32:39.0499951Z         contiguous: bool,
2025-05-07T20:32:39.0500194Z         compiled: bool,
2025-05-07T20:32:39.0500415Z     ) -> None:
2025-05-07T20:32:39.0500625Z         torch.manual_seed(2025)
2025-05-07T20:32:39.0500873Z     
2025-05-07T20:32:39.0501145Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.0501503Z     
2025-05-07T20:32:39.0501689Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.0501985Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.0502301Z         x = x_sign * x_clamp
2025-05-07T20:32:39.0502542Z         x0 = x[:, :D]
2025-05-07T20:32:39.0502892Z         x1 = x[:, D:]
2025-05-07T20:32:39.0503103Z     
2025-05-07T20:32:39.0503282Z         if contiguous:
2025-05-07T20:32:39.0503515Z             x0 = x0.contiguous()
2025-05-07T20:32:39.0503783Z             x1 = x1.contiguous()
2025-05-07T20:32:39.0504026Z     
2025-05-07T20:32:39.0504216Z         if scale_ub is not None:
2025-05-07T20:32:39.0504492Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.0504828Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.0505149Z             )
2025-05-07T20:32:39.0505335Z         else:
2025-05-07T20:32:39.0505539Z             scale_ub_tensor = None
2025-05-07T20:32:39.0505801Z     
2025-05-07T20:32:39.0506033Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.0506352Z             op = silu_mul_quant
2025-05-07T20:32:39.0512052Z             if compiled:
2025-05-07T20:32:39.0512354Z                 op = torch.compile(op)
2025-05-07T20:32:39.0512667Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.0512967Z     
2025-05-07T20:32:39.0513159Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.0513327Z 
2025-05-07T20:32:39.0513427Z moe/activation_test.py:117: 
2025-05-07T20:32:39.0513733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0514088Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.0514376Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.0515203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.0515958Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.0516531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.0517309Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.0518028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.0518604Z     kernel = self.compile(
2025-05-07T20:32:39.0519175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.0519879Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.0520299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0520543Z 
2025-05-07T20:32:39.0520762Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd40abe20>
2025-05-07T20:32:39.0521929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.0523444Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd40d40d0>}
2025-05-07T20:32:39.0524913Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.0526021Z context = <triton._C.libtriton.ir.context object at 0x7f3dd40d0fb0>
2025-05-07T20:32:39.0526324Z 
2025-05-07T20:32:39.0526500Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.0527049Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.0527543Z                            module_map=module_map)
2025-05-07T20:32:39.0527921Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.0528282Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.0528552Z E       ^
2025-05-07T20:32:39.0529046Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.0529620Z 
2025-05-07T20:32:39.0530077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.0530635Z 
2025-05-07T20:32:39.0530741Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.0531166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.0531593Z     T=16384,
2025-05-07T20:32:39.0531786Z     D=7168,
2025-05-07T20:32:39.0531978Z     scale_ub=1200.0,
2025-05-07T20:32:39.0532203Z     contiguous=False,
2025-05-07T20:32:39.0532426Z     compiled=True,
2025-05-07T20:32:39.0532634Z )
2025-05-07T20:32:39.2439438Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.2440277Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:39.2440691Z 
2025-05-07T20:32:39.2440802Z     @given(
2025-05-07T20:32:39.2441103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.2441536Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.2441945Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.2442371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.2442741Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.2443038Z     )
2025-05-07T20:32:39.2443391Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.2443987Z     def test_silu_mul_quant(
2025-05-07T20:32:39.2444228Z         self,
2025-05-07T20:32:39.2444414Z         T: int,
2025-05-07T20:32:39.2444603Z         D: int,
2025-05-07T20:32:39.2444817Z         scale_ub: Optional[float],
2025-05-07T20:32:39.2445085Z         contiguous: bool,
2025-05-07T20:32:39.2445325Z         compiled: bool,
2025-05-07T20:32:39.2445625Z     ) -> None:
2025-05-07T20:32:39.2445845Z         torch.manual_seed(2025)
2025-05-07T20:32:39.2446095Z     
2025-05-07T20:32:39.2446379Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.2446735Z     
2025-05-07T20:32:39.2446926Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.2447223Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.2447546Z         x = x_sign * x_clamp
2025-05-07T20:32:39.2447792Z         x0 = x[:, :D]
2025-05-07T20:32:39.2448014Z         x1 = x[:, D:]
2025-05-07T20:32:39.2448226Z     
2025-05-07T20:32:39.2448416Z         if contiguous:
2025-05-07T20:32:39.2448649Z             x0 = x0.contiguous()
2025-05-07T20:32:39.2448913Z             x1 = x1.contiguous()
2025-05-07T20:32:39.2449156Z     
2025-05-07T20:32:39.2449351Z         if scale_ub is not None:
2025-05-07T20:32:39.2449633Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.2449975Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.2450295Z             )
2025-05-07T20:32:39.2450487Z         else:
2025-05-07T20:32:39.2450702Z             scale_ub_tensor = None
2025-05-07T20:32:39.2450957Z     
2025-05-07T20:32:39.2451196Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.2451523Z             op = silu_mul_quant
2025-05-07T20:32:39.2451769Z             if compiled:
2025-05-07T20:32:39.2452016Z                 op = torch.compile(op)
2025-05-07T20:32:39.2452323Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2452596Z     
2025-05-07T20:32:39.2452781Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.2452948Z 
2025-05-07T20:32:39.2453054Z moe/activation_test.py:117: 
2025-05-07T20:32:39.2453347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2453691Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.2453977Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2454566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.2455154Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.2455986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.2456726Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.2457286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.2458018Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.2458723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.2459284Z     kernel = self.compile(
2025-05-07T20:32:39.2459848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.2460546Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.2460957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2461196Z 
2025-05-07T20:32:39.2461415Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3ebf490>
2025-05-07T20:32:39.2462584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.2464089Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd40d4d30>}
2025-05-07T20:32:39.2465605Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.2466802Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3f9edf0>
2025-05-07T20:32:39.2467104Z 
2025-05-07T20:32:39.2467271Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.2467819Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.2468304Z                            module_map=module_map)
2025-05-07T20:32:39.2468678Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.2469032Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.2469293Z E       ^
2025-05-07T20:32:39.2469897Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.2470386Z 
2025-05-07T20:32:39.2470831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.2471389Z 
2025-05-07T20:32:39.2471491Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.2471915Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.2472338Z     T=1,
2025-05-07T20:32:39.2472520Z     D=7168,
2025-05-07T20:32:39.2472718Z     scale_ub=None,
2025-05-07T20:32:39.2472939Z     contiguous=False,
2025-05-07T20:32:39.2473163Z     compiled=False,
2025-05-07T20:32:39.2473375Z )
2025-05-07T20:32:39.2473696Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.2474206Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:39.2474485Z 
2025-05-07T20:32:39.2474566Z     @given(
2025-05-07T20:32:39.2474798Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.2475126Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.2475438Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.2475785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.2476175Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.2476470Z     )
2025-05-07T20:32:39.2476833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.2477414Z     def test_silu_mul_quant(
2025-05-07T20:32:39.2477653Z         self,
2025-05-07T20:32:39.2477848Z         T: int,
2025-05-07T20:32:39.2478044Z         D: int,
2025-05-07T20:32:39.2478256Z         scale_ub: Optional[float],
2025-05-07T20:32:39.2478526Z         contiguous: bool,
2025-05-07T20:32:39.2478766Z         compiled: bool,
2025-05-07T20:32:39.2479007Z     ) -> None:
2025-05-07T20:32:39.2479216Z         torch.manual_seed(2025)
2025-05-07T20:32:39.2479462Z     
2025-05-07T20:32:39.2479735Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.2480081Z     
2025-05-07T20:32:39.2480267Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.2480556Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.2480870Z         x = x_sign * x_clamp
2025-05-07T20:32:39.2481108Z         x0 = x[:, :D]
2025-05-07T20:32:39.2481324Z         x1 = x[:, D:]
2025-05-07T20:32:39.2481525Z     
2025-05-07T20:32:39.2481702Z         if contiguous:
2025-05-07T20:32:39.2481937Z             x0 = x0.contiguous()
2025-05-07T20:32:39.2482194Z             x1 = x1.contiguous()
2025-05-07T20:32:39.2482432Z     
2025-05-07T20:32:39.2482624Z         if scale_ub is not None:
2025-05-07T20:32:39.2483053Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.2483493Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.2483809Z             )
2025-05-07T20:32:39.2483997Z         else:
2025-05-07T20:32:39.2484283Z             scale_ub_tensor = None
2025-05-07T20:32:39.2484555Z     
2025-05-07T20:32:39.2484793Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.2485135Z             op = silu_mul_quant
2025-05-07T20:32:39.2485401Z             if compiled:
2025-05-07T20:32:39.2485661Z                 op = torch.compile(op)
2025-05-07T20:32:39.2486048Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2486344Z     
2025-05-07T20:32:39.2486539Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.2486718Z 
2025-05-07T20:32:39.2486827Z moe/activation_test.py:117: 
2025-05-07T20:32:39.2487147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2487521Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.2487825Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2488650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.2489493Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.2490126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.2490950Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.2491741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.2492377Z     kernel = self.compile(
2025-05-07T20:32:39.2493017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.2493799Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.2494261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2494533Z 
2025-05-07T20:32:39.2494766Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3e8d940>
2025-05-07T20:32:39.2495971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.2497471Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4151700>}
2025-05-07T20:32:39.2499058Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.2500162Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3edc830>
2025-05-07T20:32:39.2500467Z 
2025-05-07T20:32:39.2500646Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.2501185Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.2501677Z                            module_map=module_map)
2025-05-07T20:32:39.2502050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.2502403Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.2502664Z E       ^
2025-05-07T20:32:39.2503152Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.2503638Z 
2025-05-07T20:32:39.2504093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.2504645Z 
2025-05-07T20:32:39.2504746Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.2505172Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.2505590Z     T=2048,
2025-05-07T20:32:39.2505794Z     D=7168,
2025-05-07T20:32:39.2506003Z     scale_ub=None,
2025-05-07T20:32:39.2506215Z     contiguous=False,
2025-05-07T20:32:39.2506481Z     compiled=True,
2025-05-07T20:32:39.2506682Z )
2025-05-07T20:32:39.3683553Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.3684147Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:39.3684563Z 
2025-05-07T20:32:39.3684676Z     @given(
2025-05-07T20:32:39.3685149Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.3685570Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.3685909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.3686268Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.3686602Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.3686890Z     )
2025-05-07T20:32:39.3687244Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.3687706Z     def test_silu_mul_quant(
2025-05-07T20:32:39.3687953Z         self,
2025-05-07T20:32:39.3688138Z         T: int,
2025-05-07T20:32:39.3688336Z         D: int,
2025-05-07T20:32:39.3688549Z         scale_ub: Optional[float],
2025-05-07T20:32:39.3688819Z         contiguous: bool,
2025-05-07T20:32:39.3689052Z         compiled: bool,
2025-05-07T20:32:39.3689272Z     ) -> None:
2025-05-07T20:32:39.3689482Z         torch.manual_seed(2025)
2025-05-07T20:32:39.3689752Z     
2025-05-07T20:32:39.3690017Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.3690371Z     
2025-05-07T20:32:39.3690557Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.3690852Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.3691160Z         x = x_sign * x_clamp
2025-05-07T20:32:39.3691397Z         x0 = x[:, :D]
2025-05-07T20:32:39.3691609Z         x1 = x[:, D:]
2025-05-07T20:32:39.3691809Z     
2025-05-07T20:32:39.3691990Z         if contiguous:
2025-05-07T20:32:39.3692216Z             x0 = x0.contiguous()
2025-05-07T20:32:39.3692472Z             x1 = x1.contiguous()
2025-05-07T20:32:39.3692711Z     
2025-05-07T20:32:39.3692899Z         if scale_ub is not None:
2025-05-07T20:32:39.3693169Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.3693509Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.3693828Z             )
2025-05-07T20:32:39.3694006Z         else:
2025-05-07T20:32:39.3694210Z             scale_ub_tensor = None
2025-05-07T20:32:39.3694460Z     
2025-05-07T20:32:39.3694683Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.3695002Z             op = silu_mul_quant
2025-05-07T20:32:39.3695954Z             if compiled:
2025-05-07T20:32:39.3696210Z                 op = torch.compile(op)
2025-05-07T20:32:39.3696507Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.3696790Z     
2025-05-07T20:32:39.3696975Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.3697144Z 
2025-05-07T20:32:39.3697244Z moe/activation_test.py:117: 
2025-05-07T20:32:39.3697545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.3697887Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.3698165Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.3698754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.3699353Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.3700056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.3700799Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.3701360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.3702091Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.3702791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.3703423Z     kernel = self.compile(
2025-05-07T20:32:39.3703991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.3704686Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.3705091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.3705382Z 
2025-05-07T20:32:39.3705595Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3f034f0>
2025-05-07T20:32:39.3706821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.3708373Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd41de3a0>}
2025-05-07T20:32:39.3709952Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.3711053Z context = <triton._C.libtriton.ir.context object at 0x7f3dd41e1b30>
2025-05-07T20:32:39.3711359Z 
2025-05-07T20:32:39.3711529Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.3712076Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.3712563Z                            module_map=module_map)
2025-05-07T20:32:39.3712929Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.3713286Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.3713542Z E       ^
2025-05-07T20:32:39.3714023Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.3714515Z 
2025-05-07T20:32:39.3714965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.3715524Z 
2025-05-07T20:32:39.3715624Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.3716045Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.3716458Z     T=4096,
2025-05-07T20:32:39.3716641Z     D=7168,
2025-05-07T20:32:39.3716828Z     scale_ub=None,
2025-05-07T20:32:39.3717036Z     contiguous=False,
2025-05-07T20:32:39.3717256Z     compiled=True,
2025-05-07T20:32:39.3717542Z )
2025-05-07T20:32:39.3717861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.3718376Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:39.3718667Z 
2025-05-07T20:32:39.3718740Z     @given(
2025-05-07T20:32:39.3718966Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.3719282Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.3719600Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.3719939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.3720268Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.3720560Z     )
2025-05-07T20:32:39.3720918Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.3721377Z     def test_silu_mul_quant(
2025-05-07T20:32:39.3721612Z         self,
2025-05-07T20:32:39.3721800Z         T: int,
2025-05-07T20:32:39.3721989Z         D: int,
2025-05-07T20:32:39.3722210Z         scale_ub: Optional[float],
2025-05-07T20:32:39.3722481Z         contiguous: bool,
2025-05-07T20:32:39.3722717Z         compiled: bool,
2025-05-07T20:32:39.3722930Z     ) -> None:
2025-05-07T20:32:39.3723142Z         torch.manual_seed(2025)
2025-05-07T20:32:39.3723389Z     
2025-05-07T20:32:39.3723658Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.3724057Z     
2025-05-07T20:32:39.3724245Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.3724527Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.3724838Z         x = x_sign * x_clamp
2025-05-07T20:32:39.3725075Z         x0 = x[:, :D]
2025-05-07T20:32:39.3725285Z         x1 = x[:, D:]
2025-05-07T20:32:39.3725680Z     
2025-05-07T20:32:39.3725912Z         if contiguous:
2025-05-07T20:32:39.3726135Z             x0 = x0.contiguous()
2025-05-07T20:32:39.3726391Z             x1 = x1.contiguous()
2025-05-07T20:32:39.3726632Z     
2025-05-07T20:32:39.3726823Z         if scale_ub is not None:
2025-05-07T20:32:39.3727094Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.3727433Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.3727748Z             )
2025-05-07T20:32:39.3727932Z         else:
2025-05-07T20:32:39.3728135Z             scale_ub_tensor = None
2025-05-07T20:32:39.3728388Z     
2025-05-07T20:32:39.3728610Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.3728932Z             op = silu_mul_quant
2025-05-07T20:32:39.3729178Z             if compiled:
2025-05-07T20:32:39.3729420Z                 op = torch.compile(op)
2025-05-07T20:32:39.3729717Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.3729995Z     
2025-05-07T20:32:39.3730176Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.3730348Z 
2025-05-07T20:32:39.3730444Z moe/activation_test.py:117: 
2025-05-07T20:32:39.3730742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.3731085Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.3731366Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.3731950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.3732543Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.3733241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.3733982Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.3734547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.3735268Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.3736024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.3736588Z     kernel = self.compile(
2025-05-07T20:32:39.3737243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.3737937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.3738344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.3738586Z 
2025-05-07T20:32:39.3738805Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3e4aa30>
2025-05-07T20:32:39.3739969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.3741465Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd41de700>}
2025-05-07T20:32:39.3742933Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.3744040Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3e8b130>
2025-05-07T20:32:39.3744344Z 
2025-05-07T20:32:39.3744516Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.3745058Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.3745605Z                            module_map=module_map)
2025-05-07T20:32:39.3745978Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.3746337Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.3746593Z E       ^
2025-05-07T20:32:39.3747120Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.3747603Z 
2025-05-07T20:32:39.3748065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.3748619Z 
2025-05-07T20:32:39.7652173Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.7652685Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.7653184Z     T=16384,
2025-05-07T20:32:39.7660093Z     D=5120,
2025-05-07T20:32:39.7660310Z     scale_ub=1200.0,
2025-05-07T20:32:39.7660587Z     contiguous=False,
2025-05-07T20:32:39.7660822Z     compiled=False,
2025-05-07T20:32:39.7661023Z )
2025-05-07T20:32:39.7661355Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.7661889Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:39.7662194Z 
2025-05-07T20:32:39.7662269Z     @given(
2025-05-07T20:32:39.7662498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.7662819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.7663141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.7663473Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.7663809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.7664105Z     )
2025-05-07T20:32:39.7664456Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.7664918Z     def test_silu_mul_quant(
2025-05-07T20:32:39.7665160Z         self,
2025-05-07T20:32:39.7665342Z         T: int,
2025-05-07T20:32:39.7665536Z         D: int,
2025-05-07T20:32:39.7665750Z         scale_ub: Optional[float],
2025-05-07T20:32:39.7666014Z         contiguous: bool,
2025-05-07T20:32:39.7666261Z         compiled: bool,
2025-05-07T20:32:39.7666485Z     ) -> None:
2025-05-07T20:32:39.7666698Z         torch.manual_seed(2025)
2025-05-07T20:32:39.7666942Z     
2025-05-07T20:32:39.7667217Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.7667571Z     
2025-05-07T20:32:39.7667928Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.7668224Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.7668537Z         x = x_sign * x_clamp
2025-05-07T20:32:39.7668773Z         x0 = x[:, :D]
2025-05-07T20:32:39.7668987Z         x1 = x[:, D:]
2025-05-07T20:32:39.7669203Z     
2025-05-07T20:32:39.7669380Z         if contiguous:
2025-05-07T20:32:39.7669608Z             x0 = x0.contiguous()
2025-05-07T20:32:39.7669969Z             x1 = x1.contiguous()
2025-05-07T20:32:39.7670209Z     
2025-05-07T20:32:39.7670394Z         if scale_ub is not None:
2025-05-07T20:32:39.7670659Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.7670996Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.7671305Z             )
2025-05-07T20:32:39.7671486Z         else:
2025-05-07T20:32:39.7671692Z             scale_ub_tensor = None
2025-05-07T20:32:39.7671946Z     
2025-05-07T20:32:39.7672166Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.7672494Z             op = silu_mul_quant
2025-05-07T20:32:39.7672745Z             if compiled:
2025-05-07T20:32:39.7672987Z                 op = torch.compile(op)
2025-05-07T20:32:39.7673283Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7673562Z     
2025-05-07T20:32:39.7673745Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.7673907Z 
2025-05-07T20:32:39.7674002Z moe/activation_test.py:117: 
2025-05-07T20:32:39.7674378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7674720Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.7674998Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7675738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.7676541Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.7677108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.7677830Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.7678532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.7679091Z     kernel = self.compile(
2025-05-07T20:32:39.7679657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.7680355Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.7680765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7681004Z 
2025-05-07T20:32:39.7681219Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3dfb2e0>
2025-05-07T20:32:39.7682383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.7684080Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3e83790>}
2025-05-07T20:32:39.7685537Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.7686642Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3ceed30>
2025-05-07T20:32:39.7686943Z 
2025-05-07T20:32:39.7687111Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.7687651Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.7688138Z                            module_map=module_map)
2025-05-07T20:32:39.7688514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.7688988Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.7689256Z E       ^
2025-05-07T20:32:39.7689747Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.7690230Z 
2025-05-07T20:32:39.7690678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.7691233Z 
2025-05-07T20:32:39.7691334Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.7691754Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.7692171Z     T=16384,
2025-05-07T20:32:39.7692351Z     D=5120,
2025-05-07T20:32:39.7692537Z     scale_ub=1200.0,
2025-05-07T20:32:39.7692752Z     contiguous=True,
2025-05-07T20:32:39.7692960Z     compiled=True,
2025-05-07T20:32:39.7693156Z )
2025-05-07T20:32:39.7693475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.7694001Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:39.7694292Z 
2025-05-07T20:32:39.7694370Z     @given(
2025-05-07T20:32:39.7694591Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.7694906Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.7695211Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.7695542Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.7695946Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.7696231Z     )
2025-05-07T20:32:39.7696584Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.7697038Z     def test_silu_mul_quant(
2025-05-07T20:32:39.7697273Z         self,
2025-05-07T20:32:39.7697529Z         T: int,
2025-05-07T20:32:39.7697724Z         D: int,
2025-05-07T20:32:39.7697938Z         scale_ub: Optional[float],
2025-05-07T20:32:39.7698205Z         contiguous: bool,
2025-05-07T20:32:39.7698446Z         compiled: bool,
2025-05-07T20:32:39.7698664Z     ) -> None:
2025-05-07T20:32:39.7698871Z         torch.manual_seed(2025)
2025-05-07T20:32:39.7699111Z     
2025-05-07T20:32:39.7699377Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.7699724Z     
2025-05-07T20:32:39.7699909Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.7700196Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.7700511Z         x = x_sign * x_clamp
2025-05-07T20:32:39.7700750Z         x0 = x[:, :D]
2025-05-07T20:32:39.7700963Z         x1 = x[:, D:]
2025-05-07T20:32:39.7701163Z     
2025-05-07T20:32:39.7701345Z         if contiguous:
2025-05-07T20:32:39.7701575Z             x0 = x0.contiguous()
2025-05-07T20:32:39.7701832Z             x1 = x1.contiguous()
2025-05-07T20:32:39.7702067Z     
2025-05-07T20:32:39.7702251Z         if scale_ub is not None:
2025-05-07T20:32:39.7702517Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.7702858Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.7703168Z             )
2025-05-07T20:32:39.7703353Z         else:
2025-05-07T20:32:39.7703551Z             scale_ub_tensor = None
2025-05-07T20:32:39.7703807Z     
2025-05-07T20:32:39.7704033Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.7704346Z             op = silu_mul_quant
2025-05-07T20:32:39.7704593Z             if compiled:
2025-05-07T20:32:39.7704839Z                 op = torch.compile(op)
2025-05-07T20:32:39.7705134Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7705413Z     
2025-05-07T20:32:39.7705602Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.7705775Z 
2025-05-07T20:32:39.7705884Z moe/activation_test.py:117: 
2025-05-07T20:32:39.7706207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7706550Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.7706829Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.7707496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.7708092Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.7708790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.7709522Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.7710160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.7710886Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.7711584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.7712144Z     kernel = self.compile(
2025-05-07T20:32:39.7712708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.7713404Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.7713805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.7714049Z 
2025-05-07T20:32:39.7714258Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3d07e20>
2025-05-07T20:32:39.7715416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.7716963Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3d91550>}
2025-05-07T20:32:39.7718500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.7719599Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3c9ef30>
2025-05-07T20:32:39.7719911Z 
2025-05-07T20:32:39.7720078Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.7720622Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.7721105Z                            module_map=module_map)
2025-05-07T20:32:39.7721472Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.7721830Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.7722086Z E       ^
2025-05-07T20:32:39.7722564Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.7723053Z 
2025-05-07T20:32:39.7723495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.7724050Z 
2025-05-07T20:32:39.9951083Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.9951516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.9951977Z     T=16384,
2025-05-07T20:32:39.9952193Z     D=5120,
2025-05-07T20:32:39.9952461Z     scale_ub=None,
2025-05-07T20:32:39.9952722Z     contiguous=False,
2025-05-07T20:32:39.9952953Z     compiled=True,
2025-05-07T20:32:39.9953159Z )
2025-05-07T20:32:39.9953496Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.9954020Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:39.9954324Z 
2025-05-07T20:32:39.9954400Z     @given(
2025-05-07T20:32:39.9954634Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.9954960Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.9955281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.9955630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.9956133Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.9956433Z     )
2025-05-07T20:32:39.9956791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.9957254Z     def test_silu_mul_quant(
2025-05-07T20:32:39.9957489Z         self,
2025-05-07T20:32:39.9957686Z         T: int,
2025-05-07T20:32:39.9957875Z         D: int,
2025-05-07T20:32:39.9958096Z         scale_ub: Optional[float],
2025-05-07T20:32:39.9958367Z         contiguous: bool,
2025-05-07T20:32:39.9958599Z         compiled: bool,
2025-05-07T20:32:39.9958822Z     ) -> None:
2025-05-07T20:32:39.9959035Z         torch.manual_seed(2025)
2025-05-07T20:32:39.9959270Z     
2025-05-07T20:32:39.9959536Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.9959889Z     
2025-05-07T20:32:39.9960071Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.9960363Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.9960681Z         x = x_sign * x_clamp
2025-05-07T20:32:39.9960917Z         x0 = x[:, :D]
2025-05-07T20:32:39.9961132Z         x1 = x[:, D:]
2025-05-07T20:32:39.9961333Z     
2025-05-07T20:32:39.9961512Z         if contiguous:
2025-05-07T20:32:39.9961750Z             x0 = x0.contiguous()
2025-05-07T20:32:39.9962008Z             x1 = x1.contiguous()
2025-05-07T20:32:39.9962240Z     
2025-05-07T20:32:39.9962426Z         if scale_ub is not None:
2025-05-07T20:32:39.9962766Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.9963103Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.9963412Z             )
2025-05-07T20:32:39.9963592Z         else:
2025-05-07T20:32:39.9963801Z             scale_ub_tensor = None
2025-05-07T20:32:39.9964045Z     
2025-05-07T20:32:39.9964338Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.9964660Z             op = silu_mul_quant
2025-05-07T20:32:39.9964905Z             if compiled:
2025-05-07T20:32:39.9965157Z                 op = torch.compile(op)
2025-05-07T20:32:39.9965452Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9965723Z     
2025-05-07T20:32:39.9965911Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.9966076Z 
2025-05-07T20:32:39.9966177Z moe/activation_test.py:117: 
2025-05-07T20:32:39.9966469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9966814Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.9967095Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9967679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.9968266Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.9968969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.9969711Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.9970272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.9970999Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.9971706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.9972269Z     kernel = self.compile(
2025-05-07T20:32:39.9972833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.9973541Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.9973951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9974190Z 
2025-05-07T20:32:39.9974411Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3e047f0>
2025-05-07T20:32:39.9975655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.9977161Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3d910d0>}
2025-05-07T20:32:39.9978635Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.9979745Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3d15b30>
2025-05-07T20:32:39.9980048Z 
2025-05-07T20:32:39.9980221Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.9980765Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.9981250Z                            module_map=module_map)
2025-05-07T20:32:39.9981631Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.9981984Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.9982252Z E       ^
2025-05-07T20:32:39.9982922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.9983415Z 
2025-05-07T20:32:39.9983865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.9984492Z 
2025-05-07T20:32:39.9984594Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.9985021Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.9985438Z     T=2048,
2025-05-07T20:32:39.9985619Z     D=5120,
2025-05-07T20:32:39.9985892Z     scale_ub=None,
2025-05-07T20:32:39.9986126Z     contiguous=False,
2025-05-07T20:32:39.9986346Z     compiled=True,
2025-05-07T20:32:39.9986546Z )
2025-05-07T20:32:40.1197903Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.1198430Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.1198740Z 
2025-05-07T20:32:40.1198825Z     @given(
2025-05-07T20:32:40.1199140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.1199529Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.1199844Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.1200178Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.1200513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.1200804Z     )
2025-05-07T20:32:40.1201166Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.1201622Z     def test_silu_mul_quant(
2025-05-07T20:32:40.1201865Z         self,
2025-05-07T20:32:40.1202051Z         T: int,
2025-05-07T20:32:40.1202237Z         D: int,
2025-05-07T20:32:40.1202448Z         scale_ub: Optional[float],
2025-05-07T20:32:40.1202722Z         contiguous: bool,
2025-05-07T20:32:40.1202954Z         compiled: bool,
2025-05-07T20:32:40.1203173Z     ) -> None:
2025-05-07T20:32:40.1203383Z         torch.manual_seed(2025)
2025-05-07T20:32:40.1203615Z     
2025-05-07T20:32:40.1203887Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.1204239Z     
2025-05-07T20:32:40.1204422Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.1204716Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.1205030Z         x = x_sign * x_clamp
2025-05-07T20:32:40.1205266Z         x0 = x[:, :D]
2025-05-07T20:32:40.1205482Z         x1 = x[:, D:]
2025-05-07T20:32:40.1205688Z     
2025-05-07T20:32:40.1205884Z         if contiguous:
2025-05-07T20:32:40.1206138Z             x0 = x0.contiguous()
2025-05-07T20:32:40.1206399Z             x1 = x1.contiguous()
2025-05-07T20:32:40.1206648Z     
2025-05-07T20:32:40.1206831Z         if scale_ub is not None:
2025-05-07T20:32:40.1207268Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.1207618Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.1207927Z             )
2025-05-07T20:32:40.1208112Z         else:
2025-05-07T20:32:40.1208315Z             scale_ub_tensor = None
2025-05-07T20:32:40.1208561Z     
2025-05-07T20:32:40.1208788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.1209115Z             op = silu_mul_quant
2025-05-07T20:32:40.1209362Z             if compiled:
2025-05-07T20:32:40.1209606Z                 op = torch.compile(op)
2025-05-07T20:32:40.1209907Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1210179Z     
2025-05-07T20:32:40.1210364Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.1210528Z 
2025-05-07T20:32:40.1210635Z moe/activation_test.py:117: 
2025-05-07T20:32:40.1210931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1211268Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.1211558Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1212144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.1212731Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.1213430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.1214234Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.1214795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.1215515Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.1216218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.1216846Z     kernel = self.compile(
2025-05-07T20:32:40.1217411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.1218100Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.1218510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1218748Z 
2025-05-07T20:32:40.1218960Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3cba940>
2025-05-07T20:32:40.1220115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.1221616Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3d14af0>}
2025-05-07T20:32:40.1223090Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.1224192Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3f61c30>
2025-05-07T20:32:40.1224499Z 
2025-05-07T20:32:40.1224672Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.1225209Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.1225696Z                            module_map=module_map)
2025-05-07T20:32:40.1226067Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.1226418Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.1226675Z E       ^
2025-05-07T20:32:40.1227161Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.1227648Z 
2025-05-07T20:32:40.1228098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.1228734Z 
2025-05-07T20:32:40.1228835Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.1229263Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.1229678Z     T=2048,
2025-05-07T20:32:40.1229950Z     D=5120,
2025-05-07T20:32:40.1230135Z     scale_ub=1200.0,
2025-05-07T20:32:40.1230354Z     contiguous=False,
2025-05-07T20:32:40.1230570Z     compiled=True,
2025-05-07T20:32:40.1230793Z )
2025-05-07T20:32:40.1231113Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.1231630Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.1231916Z 
2025-05-07T20:32:40.1231992Z     @given(
2025-05-07T20:32:40.1232214Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.1232532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.1232845Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.1233190Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.1233531Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.1233827Z     )
2025-05-07T20:32:40.1234183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.1234646Z     def test_silu_mul_quant(
2025-05-07T20:32:40.1234898Z         self,
2025-05-07T20:32:40.1235088Z         T: int,
2025-05-07T20:32:40.1235354Z         D: int,
2025-05-07T20:32:40.1235565Z         scale_ub: Optional[float],
2025-05-07T20:32:40.1235834Z         contiguous: bool,
2025-05-07T20:32:40.1236073Z         compiled: bool,
2025-05-07T20:32:40.1236292Z     ) -> None:
2025-05-07T20:32:40.1236498Z         torch.manual_seed(2025)
2025-05-07T20:32:40.1236735Z     
2025-05-07T20:32:40.1237049Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.1237398Z     
2025-05-07T20:32:40.1237580Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.1237873Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.1238189Z         x = x_sign * x_clamp
2025-05-07T20:32:40.1238423Z         x0 = x[:, :D]
2025-05-07T20:32:40.1238642Z         x1 = x[:, D:]
2025-05-07T20:32:40.1238842Z     
2025-05-07T20:32:40.1239016Z         if contiguous:
2025-05-07T20:32:40.1239242Z             x0 = x0.contiguous()
2025-05-07T20:32:40.1239494Z             x1 = x1.contiguous()
2025-05-07T20:32:40.1239728Z     
2025-05-07T20:32:40.1239915Z         if scale_ub is not None:
2025-05-07T20:32:40.1240186Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.1240518Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.1240839Z             )
2025-05-07T20:32:40.1241031Z         else:
2025-05-07T20:32:40.1241239Z             scale_ub_tensor = None
2025-05-07T20:32:40.1241489Z     
2025-05-07T20:32:40.1241714Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.1247864Z             op = silu_mul_quant
2025-05-07T20:32:40.1248136Z             if compiled:
2025-05-07T20:32:40.1248395Z                 op = torch.compile(op)
2025-05-07T20:32:40.1248699Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1248981Z     
2025-05-07T20:32:40.1249177Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.1249345Z 
2025-05-07T20:32:40.1249444Z moe/activation_test.py:117: 
2025-05-07T20:32:40.1249752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1250101Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.1250384Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1250986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.1251585Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.1252298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.1253041Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.1253751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.1254489Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.1255193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.1255765Z     kernel = self.compile(
2025-05-07T20:32:40.1256344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.1257045Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.1257456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1257704Z 
2025-05-07T20:32:40.1257919Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3b7f520>
2025-05-07T20:32:40.1259091Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.1260591Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3adf820>}
2025-05-07T20:32:40.1262057Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.1263212Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3d582f0>
2025-05-07T20:32:40.1263529Z 
2025-05-07T20:32:40.1263701Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.1264292Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.1264785Z                            module_map=module_map)
2025-05-07T20:32:40.1265165Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.1265527Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.1265795Z E       ^
2025-05-07T20:32:40.1266282Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.1266780Z 
2025-05-07T20:32:40.1267235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.1267789Z 
2025-05-07T20:32:40.3504175Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.3505479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.3506246Z     T=4096,
2025-05-07T20:32:40.3506465Z     D=5120,
2025-05-07T20:32:40.3506651Z     scale_ub=1200.0,
2025-05-07T20:32:40.3506871Z     contiguous=True,
2025-05-07T20:32:40.3507096Z     compiled=True,
2025-05-07T20:32:40.3507302Z )
2025-05-07T20:32:40.3507642Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.3508179Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:40.3508476Z 
2025-05-07T20:32:40.3508556Z     @given(
2025-05-07T20:32:40.3508787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.3509113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.3509428Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.3509917Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.3510257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.3510555Z     )
2025-05-07T20:32:40.3510917Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.3511397Z     def test_silu_mul_quant(
2025-05-07T20:32:40.3511643Z         self,
2025-05-07T20:32:40.3511829Z         T: int,
2025-05-07T20:32:40.3512028Z         D: int,
2025-05-07T20:32:40.3512437Z         scale_ub: Optional[float],
2025-05-07T20:32:40.3512719Z         contiguous: bool,
2025-05-07T20:32:40.3512967Z         compiled: bool,
2025-05-07T20:32:40.3513195Z     ) -> None:
2025-05-07T20:32:40.3513411Z         torch.manual_seed(2025)
2025-05-07T20:32:40.3513663Z     
2025-05-07T20:32:40.3513945Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.3514309Z     
2025-05-07T20:32:40.3514515Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.3514812Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.3515128Z         x = x_sign * x_clamp
2025-05-07T20:32:40.3515377Z         x0 = x[:, :D]
2025-05-07T20:32:40.3515595Z         x1 = x[:, D:]
2025-05-07T20:32:40.3515811Z     
2025-05-07T20:32:40.3515993Z         if contiguous:
2025-05-07T20:32:40.3516235Z             x0 = x0.contiguous()
2025-05-07T20:32:40.3516505Z             x1 = x1.contiguous()
2025-05-07T20:32:40.3516750Z     
2025-05-07T20:32:40.3516953Z         if scale_ub is not None:
2025-05-07T20:32:40.3517231Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.3517570Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.3517886Z             )
2025-05-07T20:32:40.3518073Z         else:
2025-05-07T20:32:40.3518271Z             scale_ub_tensor = None
2025-05-07T20:32:40.3518530Z     
2025-05-07T20:32:40.3518753Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.3519203Z             op = silu_mul_quant
2025-05-07T20:32:40.3519464Z             if compiled:
2025-05-07T20:32:40.3519718Z                 op = torch.compile(op)
2025-05-07T20:32:40.3520023Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3520315Z     
2025-05-07T20:32:40.3520505Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.3520738Z 
2025-05-07T20:32:40.3520837Z moe/activation_test.py:117: 
2025-05-07T20:32:40.3521128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3521479Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.3521765Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3522355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.3522955Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.3523665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.3524418Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.3524986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.3525724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.3526444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.3527008Z     kernel = self.compile(
2025-05-07T20:32:40.3527578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.3528272Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.3528681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3528923Z 
2025-05-07T20:32:40.3529134Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3c75b80>
2025-05-07T20:32:40.3530300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.3531811Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3bf2430>}
2025-05-07T20:32:40.3533363Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.3534465Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3c08a30>
2025-05-07T20:32:40.3534766Z 
2025-05-07T20:32:40.3534933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.3535472Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.3535964Z                            module_map=module_map)
2025-05-07T20:32:40.3536332Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.3536692Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.3536958Z E       ^
2025-05-07T20:32:40.3537437Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.3537925Z 
2025-05-07T20:32:40.3538377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.3538933Z 
2025-05-07T20:32:40.3539031Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.3539455Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.3539869Z     T=128,
2025-05-07T20:32:40.3540050Z     D=5120,
2025-05-07T20:32:40.3540236Z     scale_ub=1200.0,
2025-05-07T20:32:40.3540496Z     contiguous=False,
2025-05-07T20:32:40.3540720Z     compiled=True,
2025-05-07T20:32:40.3540918Z )
2025-05-07T20:32:40.6725801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6726422Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.6726846Z 
2025-05-07T20:32:40.6727145Z     @given(
2025-05-07T20:32:40.6727461Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6727904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6728317Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6728737Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6729169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6729498Z     )
2025-05-07T20:32:40.6729857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6730322Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6730567Z         self,
2025-05-07T20:32:40.6730761Z         T: int,
2025-05-07T20:32:40.6730962Z         D: int,
2025-05-07T20:32:40.6731187Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6731457Z         contiguous: bool,
2025-05-07T20:32:40.6731705Z         compiled: bool,
2025-05-07T20:32:40.6731936Z     ) -> None:
2025-05-07T20:32:40.6732147Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6732398Z     
2025-05-07T20:32:40.6732678Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6733039Z     
2025-05-07T20:32:40.6733229Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6733528Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6733846Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6734086Z         x0 = x[:, :D]
2025-05-07T20:32:40.6734303Z         x1 = x[:, D:]
2025-05-07T20:32:40.6734516Z     
2025-05-07T20:32:40.6734697Z         if contiguous:
2025-05-07T20:32:40.6734929Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6735195Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6735438Z     
2025-05-07T20:32:40.6735632Z         if scale_ub is not None:
2025-05-07T20:32:40.6735910Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6736289Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6736619Z             )
2025-05-07T20:32:40.6736812Z         else:
2025-05-07T20:32:40.6737019Z             scale_ub_tensor = None
2025-05-07T20:32:40.6737278Z     
2025-05-07T20:32:40.6737513Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6737968Z             op = silu_mul_quant
2025-05-07T20:32:40.6738222Z             if compiled:
2025-05-07T20:32:40.6738469Z                 op = torch.compile(op)
2025-05-07T20:32:40.6738769Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6739043Z     
2025-05-07T20:32:40.6739225Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.6739389Z 
2025-05-07T20:32:40.6739486Z moe/activation_test.py:117: 
2025-05-07T20:32:40.6739783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6740123Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.6740409Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6740991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.6741584Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.6742285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.6743028Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.6743585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6744313Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6745012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6745635Z     kernel = self.compile(
2025-05-07T20:32:40.6746197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6746884Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6747295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6747578Z 
2025-05-07T20:32:40.6747787Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3bd92b0>
2025-05-07T20:32:40.6748960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6750614Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3a36040>}
2025-05-07T20:32:40.6752090Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6753200Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3a3e7b0>
2025-05-07T20:32:40.6753508Z 
2025-05-07T20:32:40.6753678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6754225Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6754717Z                            module_map=module_map)
2025-05-07T20:32:40.6755090Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6755453Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.6755721Z E       ^
2025-05-07T20:32:40.6756212Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6756699Z 
2025-05-07T20:32:40.6757144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.6757707Z 
2025-05-07T20:32:40.6757810Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.6758239Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.6758667Z     T=16384,
2025-05-07T20:32:40.6758861Z     D=7168,
2025-05-07T20:32:40.6759057Z     scale_ub=1200.0,
2025-05-07T20:32:40.6759281Z     contiguous=True,
2025-05-07T20:32:40.6759585Z     compiled=True,
2025-05-07T20:32:40.6759784Z )
2025-05-07T20:32:40.6760107Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6760617Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:40.6760916Z 
2025-05-07T20:32:40.6760990Z     @given(
2025-05-07T20:32:40.6761214Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6761525Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6761832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6762163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6762494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6762778Z     )
2025-05-07T20:32:40.6763130Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6763591Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6763827Z         self,
2025-05-07T20:32:40.6764015Z         T: int,
2025-05-07T20:32:40.6764212Z         D: int,
2025-05-07T20:32:40.6764419Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6764689Z         contiguous: bool,
2025-05-07T20:32:40.6764928Z         compiled: bool,
2025-05-07T20:32:40.6765142Z     ) -> None:
2025-05-07T20:32:40.6765350Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6765589Z     
2025-05-07T20:32:40.6765855Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6766253Z     
2025-05-07T20:32:40.6766442Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6766722Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6767033Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6767270Z         x0 = x[:, :D]
2025-05-07T20:32:40.6767481Z         x1 = x[:, D:]
2025-05-07T20:32:40.6767724Z     
2025-05-07T20:32:40.6767904Z         if contiguous:
2025-05-07T20:32:40.6768127Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6768381Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6768624Z     
2025-05-07T20:32:40.6768815Z         if scale_ub is not None:
2025-05-07T20:32:40.6769083Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6769417Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6769732Z             )
2025-05-07T20:32:40.6769911Z         else:
2025-05-07T20:32:40.6770117Z             scale_ub_tensor = None
2025-05-07T20:32:40.6770365Z     
2025-05-07T20:32:40.6770588Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6770908Z             op = silu_mul_quant
2025-05-07T20:32:40.6771156Z             if compiled:
2025-05-07T20:32:40.6771401Z                 op = torch.compile(op)
2025-05-07T20:32:40.6771700Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6771983Z     
2025-05-07T20:32:40.6772167Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.6772336Z 
2025-05-07T20:32:40.6772429Z moe/activation_test.py:117: 
2025-05-07T20:32:40.6772728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6773070Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.6773346Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6773931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.6774522Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.6775216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.6775953Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.6776510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6777238Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6777943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6778612Z     kernel = self.compile(
2025-05-07T20:32:40.6779180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6779871Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6780272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6780515Z 
2025-05-07T20:32:40.6780727Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3a3c9d0>
2025-05-07T20:32:40.6781889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6783572Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3a36af0>}
2025-05-07T20:32:40.6785038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6786193Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3a966b0>
2025-05-07T20:32:40.6786500Z 
2025-05-07T20:32:40.6786667Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6787280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6787762Z                            module_map=module_map)
2025-05-07T20:32:40.6788132Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6788487Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.6788806Z E       ^
2025-05-07T20:32:40.6789291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6789840Z 
2025-05-07T20:32:40.6790291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.6790844Z 
2025-05-07T20:32:40.9551729Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.9553037Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.9554201Z     T=16384,
2025-05-07T20:32:40.9554630Z     D=5120,
2025-05-07T20:32:40.9555028Z     scale_ub=1200.0,
2025-05-07T20:32:40.9555474Z     contiguous=True,
2025-05-07T20:32:40.9555914Z     compiled=False,
2025-05-07T20:32:40.9556119Z )
2025-05-07T20:32:40.9556450Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.9556976Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:40.9557275Z 
2025-05-07T20:32:40.9557352Z     @given(
2025-05-07T20:32:40.9557587Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.9557908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.9558224Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.9558560Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.9558899Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.9559188Z     )
2025-05-07T20:32:40.9559547Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.9560016Z     def test_silu_mul_quant(
2025-05-07T20:32:40.9560264Z         self,
2025-05-07T20:32:40.9560454Z         T: int,
2025-05-07T20:32:40.9560649Z         D: int,
2025-05-07T20:32:40.9560867Z         scale_ub: Optional[float],
2025-05-07T20:32:40.9561137Z         contiguous: bool,
2025-05-07T20:32:40.9561380Z         compiled: bool,
2025-05-07T20:32:40.9561611Z     ) -> None:
2025-05-07T20:32:40.9561827Z         torch.manual_seed(2025)
2025-05-07T20:32:40.9562074Z     
2025-05-07T20:32:40.9562348Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.9562701Z     
2025-05-07T20:32:40.9563058Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.9563360Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.9563673Z         x = x_sign * x_clamp
2025-05-07T20:32:40.9563915Z         x0 = x[:, :D]
2025-05-07T20:32:40.9564134Z         x1 = x[:, D:]
2025-05-07T20:32:40.9564337Z     
2025-05-07T20:32:40.9564519Z         if contiguous:
2025-05-07T20:32:40.9564752Z             x0 = x0.contiguous()
2025-05-07T20:32:40.9565018Z             x1 = x1.contiguous()
2025-05-07T20:32:40.9565262Z     
2025-05-07T20:32:40.9565451Z         if scale_ub is not None:
2025-05-07T20:32:40.9565726Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.9566062Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.9566387Z             )
2025-05-07T20:32:40.9566577Z         else:
2025-05-07T20:32:40.9566781Z             scale_ub_tensor = None
2025-05-07T20:32:40.9567040Z     
2025-05-07T20:32:40.9567280Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.9567602Z             op = silu_mul_quant
2025-05-07T20:32:40.9567851Z             if compiled:
2025-05-07T20:32:40.9568100Z                 op = torch.compile(op)
2025-05-07T20:32:40.9568397Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9568677Z     
2025-05-07T20:32:40.9568867Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.9569034Z 
2025-05-07T20:32:40.9569209Z moe/activation_test.py:117: 
2025-05-07T20:32:40.9569508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9569860Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.9570154Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9570890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.9571693Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.9572258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.9572981Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.9573692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.9574254Z     kernel = self.compile(
2025-05-07T20:32:40.9574818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.9575510Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.9575924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9576169Z 
2025-05-07T20:32:40.9576379Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd39d1fa0>
2025-05-07T20:32:40.9577555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.9579050Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3a04550>}
2025-05-07T20:32:40.9580510Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.9581613Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3a07a70>
2025-05-07T20:32:40.9581916Z 
2025-05-07T20:32:40.9582085Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.9582631Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.9583309Z                            module_map=module_map)
2025-05-07T20:32:40.9583805Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.9584170Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.9584434Z E       ^
2025-05-07T20:32:40.9584919Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.9585406Z 
2025-05-07T20:32:40.9585853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.9592172Z 
2025-05-07T20:32:40.9592302Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.9592756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.9593187Z     T=1,
2025-05-07T20:32:40.9593374Z     D=7168,
2025-05-07T20:32:40.9593574Z     scale_ub=1200.0,
2025-05-07T20:32:40.9593805Z     contiguous=False,
2025-05-07T20:32:40.9594039Z     compiled=False,
2025-05-07T20:32:40.9594252Z )
2025-05-07T20:32:40.9594583Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.9595110Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:40.9595402Z 
2025-05-07T20:32:40.9595481Z     @given(
2025-05-07T20:32:40.9595714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.9596035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.9596350Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.9596794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.9597128Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.9597420Z     )
2025-05-07T20:32:40.9597782Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.9598249Z     def test_silu_mul_quant(
2025-05-07T20:32:40.9598556Z         self,
2025-05-07T20:32:40.9598747Z         T: int,
2025-05-07T20:32:40.9598938Z         D: int,
2025-05-07T20:32:40.9599148Z         scale_ub: Optional[float],
2025-05-07T20:32:40.9599433Z         contiguous: bool,
2025-05-07T20:32:40.9599674Z         compiled: bool,
2025-05-07T20:32:40.9599895Z     ) -> None:
2025-05-07T20:32:40.9600111Z         torch.manual_seed(2025)
2025-05-07T20:32:40.9600355Z     
2025-05-07T20:32:40.9600625Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.9600985Z     
2025-05-07T20:32:40.9601172Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.9601462Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.9601784Z         x = x_sign * x_clamp
2025-05-07T20:32:40.9602024Z         x0 = x[:, :D]
2025-05-07T20:32:40.9602234Z         x1 = x[:, D:]
2025-05-07T20:32:40.9602439Z     
2025-05-07T20:32:40.9602622Z         if contiguous:
2025-05-07T20:32:40.9602848Z             x0 = x0.contiguous()
2025-05-07T20:32:40.9603108Z             x1 = x1.contiguous()
2025-05-07T20:32:40.9603351Z     
2025-05-07T20:32:40.9603539Z         if scale_ub is not None:
2025-05-07T20:32:40.9603814Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.9604154Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.9604477Z             )
2025-05-07T20:32:40.9604665Z         else:
2025-05-07T20:32:40.9604866Z             scale_ub_tensor = None
2025-05-07T20:32:40.9605116Z     
2025-05-07T20:32:40.9605347Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.9605670Z             op = silu_mul_quant
2025-05-07T20:32:40.9605926Z             if compiled:
2025-05-07T20:32:40.9606170Z                 op = torch.compile(op)
2025-05-07T20:32:40.9606466Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9606744Z     
2025-05-07T20:32:40.9606928Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.9607095Z 
2025-05-07T20:32:40.9607193Z moe/activation_test.py:117: 
2025-05-07T20:32:40.9607496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9607844Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.9608213Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.9608949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.9609687Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.9610247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.9610969Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.9611672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.9612234Z     kernel = self.compile(
2025-05-07T20:32:40.9612798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.9613487Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.9613905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.9614235Z 
2025-05-07T20:32:40.9614539Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3ac20d0>
2025-05-07T20:32:40.9615956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.9617533Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3a04820>}
2025-05-07T20:32:40.9618995Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.9620136Z context = <triton._C.libtriton.ir.context object at 0x7f3dd397e170>
2025-05-07T20:32:40.9620441Z 
2025-05-07T20:32:40.9620619Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.9621160Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.9621649Z                            module_map=module_map)
2025-05-07T20:32:40.9622023Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.9622381Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.9622644Z E       ^
2025-05-07T20:32:40.9623128Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.9623611Z 
2025-05-07T20:32:40.9624062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.9624618Z 
2025-05-07T20:32:40.9624723Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.9625139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.9625560Z     T=4096,
2025-05-07T20:32:40.9625740Z     D=7168,
2025-05-07T20:32:40.9625924Z     scale_ub=1200.0,
2025-05-07T20:32:40.9626167Z     contiguous=False,
2025-05-07T20:32:40.9626410Z     compiled=True,
2025-05-07T20:32:40.9626605Z )
2025-05-07T20:32:41.0798932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.0799749Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.0800185Z 
2025-05-07T20:32:41.0800300Z     @given(
2025-05-07T20:32:41.0800605Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.0801013Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.0801325Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.0801652Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.0801992Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.0802279Z     )
2025-05-07T20:32:41.0802837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.0803303Z     def test_silu_mul_quant(
2025-05-07T20:32:41.0803543Z         self,
2025-05-07T20:32:41.0803730Z         T: int,
2025-05-07T20:32:41.0803917Z         D: int,
2025-05-07T20:32:41.0804131Z         scale_ub: Optional[float],
2025-05-07T20:32:41.0804401Z         contiguous: bool,
2025-05-07T20:32:41.0804633Z         compiled: bool,
2025-05-07T20:32:41.0804857Z     ) -> None:
2025-05-07T20:32:41.0805071Z         torch.manual_seed(2025)
2025-05-07T20:32:41.0805308Z     
2025-05-07T20:32:41.0805576Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.0805924Z     
2025-05-07T20:32:41.0806107Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.0806395Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.0806712Z         x = x_sign * x_clamp
2025-05-07T20:32:41.0806946Z         x0 = x[:, :D]
2025-05-07T20:32:41.0807169Z         x1 = x[:, D:]
2025-05-07T20:32:41.0807376Z     
2025-05-07T20:32:41.0807565Z         if contiguous:
2025-05-07T20:32:41.0807793Z             x0 = x0.contiguous()
2025-05-07T20:32:41.0808051Z             x1 = x1.contiguous()
2025-05-07T20:32:41.0808293Z     
2025-05-07T20:32:41.0808474Z         if scale_ub is not None:
2025-05-07T20:32:41.0808749Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.0809087Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.0809473Z             )
2025-05-07T20:32:41.0809671Z         else:
2025-05-07T20:32:41.0809881Z             scale_ub_tensor = None
2025-05-07T20:32:41.0810125Z     
2025-05-07T20:32:41.0810352Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.0810672Z             op = silu_mul_quant
2025-05-07T20:32:41.0810917Z             if compiled:
2025-05-07T20:32:41.0811224Z                 op = torch.compile(op)
2025-05-07T20:32:41.0811525Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0811803Z     
2025-05-07T20:32:41.0811998Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.0812169Z 
2025-05-07T20:32:41.0812269Z moe/activation_test.py:117: 
2025-05-07T20:32:41.0812561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0812898Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.0813179Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0813768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.0814361Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.0815064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.0815805Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.0816375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.0817124Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.0817823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.0818386Z     kernel = self.compile(
2025-05-07T20:32:41.0818950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.0819639Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.0820053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0820298Z 
2025-05-07T20:32:41.0820509Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3983100>
2025-05-07T20:32:41.0821675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.0823266Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3971ca0>}
2025-05-07T20:32:41.0824729Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.0825826Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3c17130>
2025-05-07T20:32:41.0826133Z 
2025-05-07T20:32:41.0826300Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.0826842Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.0827324Z                            module_map=module_map)
2025-05-07T20:32:41.0827697Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.0828056Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.0828312Z E       ^
2025-05-07T20:32:41.0828800Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.0829290Z 
2025-05-07T20:32:41.0829891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.0830446Z 
2025-05-07T20:32:41.0830554Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.0831016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.0831427Z     T=128,
2025-05-07T20:32:41.0831611Z     D=7168,
2025-05-07T20:32:41.0831796Z     scale_ub=1200.0,
2025-05-07T20:32:41.0832017Z     contiguous=False,
2025-05-07T20:32:41.0832234Z     compiled=True,
2025-05-07T20:32:41.0832430Z )
2025-05-07T20:32:41.0832792Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.0833309Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.0833590Z 
2025-05-07T20:32:41.0833677Z     @given(
2025-05-07T20:32:41.0833895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.0834208Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.0834517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.0834850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.0835184Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.0835476Z     )
2025-05-07T20:32:41.0835829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.0836291Z     def test_silu_mul_quant(
2025-05-07T20:32:41.0836529Z         self,
2025-05-07T20:32:41.0836717Z         T: int,
2025-05-07T20:32:41.0836903Z         D: int,
2025-05-07T20:32:41.0837129Z         scale_ub: Optional[float],
2025-05-07T20:32:41.0837393Z         contiguous: bool,
2025-05-07T20:32:41.0837630Z         compiled: bool,
2025-05-07T20:32:41.0837852Z     ) -> None:
2025-05-07T20:32:41.0838062Z         torch.manual_seed(2025)
2025-05-07T20:32:41.0838303Z     
2025-05-07T20:32:41.0838569Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.0838917Z     
2025-05-07T20:32:41.0839104Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.0839393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.0839699Z         x = x_sign * x_clamp
2025-05-07T20:32:41.0839947Z         x0 = x[:, :D]
2025-05-07T20:32:41.0840160Z         x1 = x[:, D:]
2025-05-07T20:32:41.0840361Z     
2025-05-07T20:32:41.0840544Z         if contiguous:
2025-05-07T20:32:41.0840769Z             x0 = x0.contiguous()
2025-05-07T20:32:41.0841021Z             x1 = x1.contiguous()
2025-05-07T20:32:41.0841260Z     
2025-05-07T20:32:41.0841445Z         if scale_ub is not None:
2025-05-07T20:32:41.0841717Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.0842049Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.0842366Z             )
2025-05-07T20:32:41.0842637Z         else:
2025-05-07T20:32:41.0842839Z             scale_ub_tensor = None
2025-05-07T20:32:41.0843088Z     
2025-05-07T20:32:41.0843313Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.0843627Z             op = silu_mul_quant
2025-05-07T20:32:41.0843884Z             if compiled:
2025-05-07T20:32:41.0844124Z                 op = torch.compile(op)
2025-05-07T20:32:41.0844419Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0844696Z     
2025-05-07T20:32:41.0844887Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.0845051Z 
2025-05-07T20:32:41.0845146Z moe/activation_test.py:117: 
2025-05-07T20:32:41.0845448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0845791Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.0846072Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0846657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.0847249Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.0847946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.0848679Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.0849239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.0850012Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.0850713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.0851267Z     kernel = self.compile(
2025-05-07T20:32:41.0851832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.0852564Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.0852977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0853216Z 
2025-05-07T20:32:41.0853429Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3c4b5b0>
2025-05-07T20:32:41.0854585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.0856106Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3e2e8b0>}
2025-05-07T20:32:41.0857609Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.0858713Z context = <triton._C.libtriton.ir.context object at 0x7f3dd39041b0>
2025-05-07T20:32:41.0859026Z 
2025-05-07T20:32:41.0859194Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.0859739Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.0860220Z                            module_map=module_map)
2025-05-07T20:32:41.0860587Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.0860946Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.0861206Z E       ^
2025-05-07T20:32:41.0861688Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.0862176Z 
2025-05-07T20:32:41.0862620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.0863179Z 
2025-05-07T20:32:41.2634174Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2635806Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2636597Z     T=2048,
2025-05-07T20:32:41.2636843Z     D=7168,
2025-05-07T20:32:41.2637032Z     scale_ub=None,
2025-05-07T20:32:41.2637243Z     contiguous=True,
2025-05-07T20:32:41.2637461Z     compiled=True,
2025-05-07T20:32:41.2637662Z )
2025-05-07T20:32:41.2637985Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2638505Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.2638788Z 
2025-05-07T20:32:41.2638862Z     @given(
2025-05-07T20:32:41.2639088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2639402Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2639710Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2640051Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2640389Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2640674Z     )
2025-05-07T20:32:41.2641038Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2641499Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2641744Z         self,
2025-05-07T20:32:41.2641927Z         T: int,
2025-05-07T20:32:41.2642122Z         D: int,
2025-05-07T20:32:41.2642337Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2642603Z         contiguous: bool,
2025-05-07T20:32:41.2642918Z         compiled: bool,
2025-05-07T20:32:41.2643143Z     ) -> None:
2025-05-07T20:32:41.2643356Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2643607Z     
2025-05-07T20:32:41.2643882Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2644233Z     
2025-05-07T20:32:41.2644425Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2644806Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2645123Z         x = x_sign * x_clamp
2025-05-07T20:32:41.2645367Z         x0 = x[:, :D]
2025-05-07T20:32:41.2645587Z         x1 = x[:, D:]
2025-05-07T20:32:41.2645790Z     
2025-05-07T20:32:41.2645973Z         if contiguous:
2025-05-07T20:32:41.2646206Z             x0 = x0.contiguous()
2025-05-07T20:32:41.2646461Z             x1 = x1.contiguous()
2025-05-07T20:32:41.2646700Z     
2025-05-07T20:32:41.2646888Z         if scale_ub is not None:
2025-05-07T20:32:41.2647158Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.2647495Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.2647814Z             )
2025-05-07T20:32:41.2648003Z         else:
2025-05-07T20:32:41.2648205Z             scale_ub_tensor = None
2025-05-07T20:32:41.2648457Z     
2025-05-07T20:32:41.2648686Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2649000Z             op = silu_mul_quant
2025-05-07T20:32:41.2649250Z             if compiled:
2025-05-07T20:32:41.2649490Z                 op = torch.compile(op)
2025-05-07T20:32:41.2649781Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2650065Z     
2025-05-07T20:32:41.2650253Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.2650416Z 
2025-05-07T20:32:41.2650512Z moe/activation_test.py:117: 
2025-05-07T20:32:41.2650806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2651146Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.2651429Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2652010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.2652600Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.2653299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.2654030Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.2654590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2655399Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2656105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2656661Z     kernel = self.compile(
2025-05-07T20:32:41.2657228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2657922Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2658322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2658563Z 
2025-05-07T20:32:41.2658773Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3c27580>
2025-05-07T20:32:41.2659937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2661447Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd379e550>}
2025-05-07T20:32:41.2662903Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2664044Z context = <triton._C.libtriton.ir.context object at 0x7f3dd37c12b0>
2025-05-07T20:32:41.2664352Z 
2025-05-07T20:32:41.2664518Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2665062Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2665589Z                            module_map=module_map)
2025-05-07T20:32:41.2665955Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2666308Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.2666579Z E       ^
2025-05-07T20:32:41.2667060Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2667550Z 
2025-05-07T20:32:41.2668017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2668574Z 
2025-05-07T20:32:41.2668681Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2669101Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2669517Z     T=16384,
2025-05-07T20:32:41.2669808Z     D=5120,
2025-05-07T20:32:41.2670024Z     scale_ub=None,
2025-05-07T20:32:41.2670233Z     contiguous=False,
2025-05-07T20:32:41.2670451Z     compiled=False,
2025-05-07T20:32:41.2670651Z )
2025-05-07T20:32:41.2670972Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2671494Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.2671790Z 
2025-05-07T20:32:41.2671865Z     @given(
2025-05-07T20:32:41.2672088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2672405Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2672712Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2673046Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2673386Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2673672Z     )
2025-05-07T20:32:41.2674031Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2674490Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2674725Z         self,
2025-05-07T20:32:41.2674912Z         T: int,
2025-05-07T20:32:41.2675107Z         D: int,
2025-05-07T20:32:41.2675312Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2675590Z         contiguous: bool,
2025-05-07T20:32:41.2675825Z         compiled: bool,
2025-05-07T20:32:41.2676123Z     ) -> None:
2025-05-07T20:32:41.2676340Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2676577Z     
2025-05-07T20:32:41.2676845Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2677192Z     
2025-05-07T20:32:41.2677378Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2677675Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2679880Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.2681963Z 
2025-05-07T20:32:41.2682084Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.2682301Z 
2025-05-07T20:32:41.2682403Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2683078Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2683505Z     T=4096,
2025-05-07T20:32:41.2683681Z     D=7168,
2025-05-07T20:32:41.2683867Z     scale_ub=1200.0,
2025-05-07T20:32:41.2684164Z     contiguous=True,
2025-05-07T20:32:41.2684378Z     compiled=True,
2025-05-07T20:32:41.2684574Z )
2025-05-07T20:32:41.2684896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2685409Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.2685693Z 
2025-05-07T20:32:41.2685833Z     @given(
2025-05-07T20:32:41.2686065Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2686431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2686738Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2692917Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2693276Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2693571Z     )
2025-05-07T20:32:41.2693940Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2694425Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2694676Z         self,
2025-05-07T20:32:41.2694876Z         T: int,
2025-05-07T20:32:41.2695076Z         D: int,
2025-05-07T20:32:41.2695296Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2695573Z         contiguous: bool,
2025-05-07T20:32:41.2695821Z         compiled: bool,
2025-05-07T20:32:41.2696054Z     ) -> None:
2025-05-07T20:32:41.2696272Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2696525Z     
2025-05-07T20:32:41.2696803Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2697162Z     
2025-05-07T20:32:41.2697362Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2697659Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2699889Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.2701960Z 
2025-05-07T20:32:41.2702078Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.2702307Z 
2025-05-07T20:32:41.2702413Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2702835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2703417Z     T=16384,
2025-05-07T20:32:41.2703611Z     D=7168,
2025-05-07T20:32:41.2703810Z     scale_ub=None,
2025-05-07T20:32:41.2704031Z     contiguous=False,
2025-05-07T20:32:41.2704254Z     compiled=False,
2025-05-07T20:32:41.2704461Z )
2025-05-07T20:32:41.3754343Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3755137Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.3755541Z 
2025-05-07T20:32:41.3755625Z     @given(
2025-05-07T20:32:41.3755854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3756177Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3756490Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3756827Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3757160Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3757459Z     )
2025-05-07T20:32:41.3757820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3758284Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3758530Z         self,
2025-05-07T20:32:41.3758719Z         T: int,
2025-05-07T20:32:41.3758916Z         D: int,
2025-05-07T20:32:41.3759135Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3759405Z         contiguous: bool,
2025-05-07T20:32:41.3759651Z         compiled: bool,
2025-05-07T20:32:41.3760007Z     ) -> None:
2025-05-07T20:32:41.3760215Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3760459Z     
2025-05-07T20:32:41.3760728Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3762996Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.3765151Z 
2025-05-07T20:32:41.3765266Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.3765482Z 
2025-05-07T20:32:41.3765589Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3766012Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3766429Z     T=2048,
2025-05-07T20:32:41.3766612Z     D=7168,
2025-05-07T20:32:41.3766795Z     scale_ub=1200.0,
2025-05-07T20:32:41.3767009Z     contiguous=True,
2025-05-07T20:32:41.3767218Z     compiled=True,
2025-05-07T20:32:41.3767412Z )
2025-05-07T20:32:41.3767738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3768249Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.3768537Z 
2025-05-07T20:32:41.3768616Z     @given(
2025-05-07T20:32:41.3768837Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3769149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3769455Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3769783Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3770113Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3770398Z     )
2025-05-07T20:32:41.3770749Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3771206Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3771447Z         self,
2025-05-07T20:32:41.3771633Z         T: int,
2025-05-07T20:32:41.3771831Z         D: int,
2025-05-07T20:32:41.3772050Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3772316Z         contiguous: bool,
2025-05-07T20:32:41.3772547Z         compiled: bool,
2025-05-07T20:32:41.3772759Z     ) -> None:
2025-05-07T20:32:41.3773098Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3773337Z     
2025-05-07T20:32:41.3773604Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3773946Z     
2025-05-07T20:32:41.3774127Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.3774413Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.3776641Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.3778694Z 
2025-05-07T20:32:41.3778820Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.3779036Z 
2025-05-07T20:32:41.3779140Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3779555Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3779966Z     T=2048,
2025-05-07T20:32:41.3780139Z     D=7168,
2025-05-07T20:32:41.3780317Z     scale_ub=None,
2025-05-07T20:32:41.3780522Z     contiguous=True,
2025-05-07T20:32:41.3780788Z     compiled=False,
2025-05-07T20:32:41.3780986Z )
2025-05-07T20:32:41.3781310Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3781824Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.3782107Z 
2025-05-07T20:32:41.3782183Z     @given(
2025-05-07T20:32:41.3782448Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3782942Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3783250Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3783583Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3783913Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3784201Z     )
2025-05-07T20:32:41.3784553Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3785008Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3785245Z         self,
2025-05-07T20:32:41.3785427Z         T: int,
2025-05-07T20:32:41.3785615Z         D: int,
2025-05-07T20:32:41.3785826Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3786089Z         contiguous: bool,
2025-05-07T20:32:41.3786322Z         compiled: bool,
2025-05-07T20:32:41.3786564Z     ) -> None:
2025-05-07T20:32:41.3786796Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3787042Z     
2025-05-07T20:32:41.3787302Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3787650Z     
2025-05-07T20:32:41.3787831Z >       x_sign = torch.sign(x)
2025-05-07T20:32:41.3790055Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.3792121Z 
2025-05-07T20:32:41.3792237Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:41.3792453Z 
2025-05-07T20:32:41.3792558Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3792975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3793390Z     T=1,
2025-05-07T20:32:41.3793566Z     D=7168,
2025-05-07T20:32:41.3793746Z     scale_ub=1200.0,
2025-05-07T20:32:41.3794085Z     contiguous=True,
2025-05-07T20:32:41.3794302Z     compiled=False,
2025-05-07T20:32:41.3794500Z )
2025-05-07T20:32:41.5358626Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5359449Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.5359828Z 
2025-05-07T20:32:41.5359935Z     @given(
2025-05-07T20:32:41.5360248Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5360572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5360877Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5361216Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5361556Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5361850Z     )
2025-05-07T20:32:41.5362209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5362679Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5362937Z         self,
2025-05-07T20:32:41.5363129Z         T: int,
2025-05-07T20:32:41.5363331Z         D: int,
2025-05-07T20:32:41.5363549Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5363817Z         contiguous: bool,
2025-05-07T20:32:41.5364060Z         compiled: bool,
2025-05-07T20:32:41.5364284Z     ) -> None:
2025-05-07T20:32:41.5364491Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5364748Z     
2025-05-07T20:32:41.5365160Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5365510Z     
2025-05-07T20:32:41.5365694Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5365983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5366302Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5366624Z         x0 = x[:, :D]
2025-05-07T20:32:41.5366834Z         x1 = x[:, D:]
2025-05-07T20:32:41.5367036Z     
2025-05-07T20:32:41.5367216Z         if contiguous:
2025-05-07T20:32:41.5367443Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5367705Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5367937Z     
2025-05-07T20:32:41.5368121Z         if scale_ub is not None:
2025-05-07T20:32:41.5368392Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5368727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5369041Z             )
2025-05-07T20:32:41.5369231Z         else:
2025-05-07T20:32:41.5369433Z             scale_ub_tensor = None
2025-05-07T20:32:41.5369683Z     
2025-05-07T20:32:41.5369908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5370222Z             op = silu_mul_quant
2025-05-07T20:32:41.5370470Z             if compiled:
2025-05-07T20:32:41.5370713Z                 op = torch.compile(op)
2025-05-07T20:32:41.5371027Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5371305Z     
2025-05-07T20:32:41.5371489Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5371656Z 
2025-05-07T20:32:41.5371752Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5372051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5372387Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5372668Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5373400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5374138Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5374701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5375426Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5376130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5376688Z     kernel = self.compile(
2025-05-07T20:32:41.5377384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5378083Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5378487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5378726Z 
2025-05-07T20:32:41.5378935Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd38407c0>
2025-05-07T20:32:41.5380100Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5381600Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd35eb0d0>}
2025-05-07T20:32:41.5383250Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5384354Z context = <triton._C.libtriton.ir.context object at 0x7f3dd35e9770>
2025-05-07T20:32:41.5384659Z 
2025-05-07T20:32:41.5384828Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5385374Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5385933Z                            module_map=module_map)
2025-05-07T20:32:41.5386299Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5386656Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5386915Z E       ^
2025-05-07T20:32:41.5387402Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5387956Z 
2025-05-07T20:32:41.5388401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5388961Z 
2025-05-07T20:32:41.5389059Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5389481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5390019Z     T=128,
2025-05-07T20:32:41.5390201Z     D=5120,
2025-05-07T20:32:41.5390385Z     scale_ub=None,
2025-05-07T20:32:41.5390590Z     contiguous=True,
2025-05-07T20:32:41.5390806Z     compiled=False,
2025-05-07T20:32:41.5391005Z )
2025-05-07T20:32:41.5391316Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5391829Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.5392109Z 
2025-05-07T20:32:41.5392185Z     @given(
2025-05-07T20:32:41.5392411Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5392724Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5393033Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5393374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5393706Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5393995Z     )
2025-05-07T20:32:41.5394351Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5394806Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5395044Z         self,
2025-05-07T20:32:41.5395233Z         T: int,
2025-05-07T20:32:41.5395427Z         D: int,
2025-05-07T20:32:41.5395634Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5395904Z         contiguous: bool,
2025-05-07T20:32:41.5396139Z         compiled: bool,
2025-05-07T20:32:41.5396348Z     ) -> None:
2025-05-07T20:32:41.5396556Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5396793Z     
2025-05-07T20:32:41.5397059Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5397410Z     
2025-05-07T20:32:41.5397596Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5398012Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5398329Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5398572Z         x0 = x[:, :D]
2025-05-07T20:32:41.5398781Z         x1 = x[:, D:]
2025-05-07T20:32:41.5398981Z     
2025-05-07T20:32:41.5399162Z         if contiguous:
2025-05-07T20:32:41.5399383Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5399639Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5399880Z     
2025-05-07T20:32:41.5400066Z         if scale_ub is not None:
2025-05-07T20:32:41.5400342Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5400682Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5401004Z             )
2025-05-07T20:32:41.5401191Z         else:
2025-05-07T20:32:41.5401399Z             scale_ub_tensor = None
2025-05-07T20:32:41.5401663Z     
2025-05-07T20:32:41.5401887Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5402210Z             op = silu_mul_quant
2025-05-07T20:32:41.5402469Z             if compiled:
2025-05-07T20:32:41.5402712Z                 op = torch.compile(op)
2025-05-07T20:32:41.5403015Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5403296Z     
2025-05-07T20:32:41.5403481Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5403650Z 
2025-05-07T20:32:41.5403745Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5404042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5404436Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5404714Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5405449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5406195Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5406801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5407533Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5408240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5408806Z     kernel = self.compile(
2025-05-07T20:32:41.5409371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5410072Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5410486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5410727Z 
2025-05-07T20:32:41.5410936Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3609730>
2025-05-07T20:32:41.5412103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5413610Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd35ebaf0>}
2025-05-07T20:32:41.5415081Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5416193Z context = <triton._C.libtriton.ir.context object at 0x7f3dd36b3f70>
2025-05-07T20:32:41.5416498Z 
2025-05-07T20:32:41.5416667Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5417216Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5417705Z                            module_map=module_map)
2025-05-07T20:32:41.5418082Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5418440Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5418786Z E       ^
2025-05-07T20:32:41.5419276Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5419765Z 
2025-05-07T20:32:41.5420214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5420774Z 
2025-05-07T20:32:41.5420875Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5421306Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5421725Z     T=128,
2025-05-07T20:32:41.5421906Z     D=7168,
2025-05-07T20:32:41.5422094Z     scale_ub=None,
2025-05-07T20:32:41.5422306Z     contiguous=True,
2025-05-07T20:32:41.5422522Z     compiled=False,
2025-05-07T20:32:41.5422726Z )
2025-05-07T20:32:41.6322897Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6324409Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.6325205Z 
2025-05-07T20:32:41.6325402Z     @given(
2025-05-07T20:32:41.6325847Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6326312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6326620Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6326958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6327292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6327696Z     )
2025-05-07T20:32:41.6328048Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6328511Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6328749Z         self,
2025-05-07T20:32:41.6328932Z         T: int,
2025-05-07T20:32:41.6329125Z         D: int,
2025-05-07T20:32:41.6329413Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6329684Z         contiguous: bool,
2025-05-07T20:32:41.6329922Z         compiled: bool,
2025-05-07T20:32:41.6330146Z     ) -> None:
2025-05-07T20:32:41.6330361Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6330609Z     
2025-05-07T20:32:41.6330882Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6331228Z     
2025-05-07T20:32:41.6331413Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.6331700Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.6332009Z         x = x_sign * x_clamp
2025-05-07T20:32:41.6332250Z         x0 = x[:, :D]
2025-05-07T20:32:41.6332463Z         x1 = x[:, D:]
2025-05-07T20:32:41.6332667Z     
2025-05-07T20:32:41.6332845Z         if contiguous:
2025-05-07T20:32:41.6333072Z             x0 = x0.contiguous()
2025-05-07T20:32:41.6333328Z             x1 = x1.contiguous()
2025-05-07T20:32:41.6333562Z     
2025-05-07T20:32:41.6333748Z         if scale_ub is not None:
2025-05-07T20:32:41.6334022Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.6334357Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.6334668Z             )
2025-05-07T20:32:41.6334858Z         else:
2025-05-07T20:32:41.6335057Z             scale_ub_tensor = None
2025-05-07T20:32:41.6335309Z     
2025-05-07T20:32:41.6335536Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.6335848Z             op = silu_mul_quant
2025-05-07T20:32:41.6336102Z             if compiled:
2025-05-07T20:32:41.6336343Z                 op = torch.compile(op)
2025-05-07T20:32:41.6336643Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6336918Z     
2025-05-07T20:32:41.6337105Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.6337271Z 
2025-05-07T20:32:41.6337369Z moe/activation_test.py:117: 
2025-05-07T20:32:41.6337662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6338004Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.6338283Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6339164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.6339913Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.6340476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.6341203Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.6341901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.6342465Z     kernel = self.compile(
2025-05-07T20:32:41.6343033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.6343722Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.6344130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6344374Z 
2025-05-07T20:32:41.6344592Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd36046d0>
2025-05-07T20:32:41.6345763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.6347262Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd36f0550>}
2025-05-07T20:32:41.6348771Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.6349987Z context = <triton._C.libtriton.ir.context object at 0x7f3dd36d7670>
2025-05-07T20:32:41.6350337Z 
2025-05-07T20:32:41.6350504Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.6351045Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.6351526Z                            module_map=module_map)
2025-05-07T20:32:41.6351903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.6352261Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.6352515Z E       ^
2025-05-07T20:32:41.6353007Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.6353500Z 
2025-05-07T20:32:41.6353949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.6354503Z 
2025-05-07T20:32:41.6354606Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6355027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6355448Z     T=2048,
2025-05-07T20:32:41.6355639Z     D=7168,
2025-05-07T20:32:41.6361945Z     scale_ub=1200.0,
2025-05-07T20:32:41.6362196Z     contiguous=True,
2025-05-07T20:32:41.6362425Z     compiled=False,
2025-05-07T20:32:41.6362638Z )
2025-05-07T20:32:41.6362966Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6363494Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.6363791Z 
2025-05-07T20:32:41.6363867Z     @given(
2025-05-07T20:32:41.6364099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6364413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6364725Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6365064Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6365399Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6365694Z     )
2025-05-07T20:32:41.6366054Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6366515Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6366867Z         self,
2025-05-07T20:32:41.6367063Z         T: int,
2025-05-07T20:32:41.6367256Z         D: int,
2025-05-07T20:32:41.6367473Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6367749Z         contiguous: bool,
2025-05-07T20:32:41.6367988Z         compiled: bool,
2025-05-07T20:32:41.6368209Z     ) -> None:
2025-05-07T20:32:41.6368423Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6368672Z     
2025-05-07T20:32:41.6368948Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6371217Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.6373280Z 
2025-05-07T20:32:41.6373400Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.6373622Z 
2025-05-07T20:32:41.6373723Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6374146Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6374605Z     T=1,
2025-05-07T20:32:41.6374785Z     D=5120,
2025-05-07T20:32:41.6374973Z     scale_ub=1200.0,
2025-05-07T20:32:41.6375191Z     contiguous=True,
2025-05-07T20:32:41.6375404Z     compiled=False,
2025-05-07T20:32:41.6375603Z )
2025-05-07T20:32:41.6855287Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6856650Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.6857032Z 
2025-05-07T20:32:41.6857146Z     @given(
2025-05-07T20:32:41.6857438Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6857769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6858081Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6858416Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6858758Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6859059Z     )
2025-05-07T20:32:41.6859420Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6859884Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6860125Z         self,
2025-05-07T20:32:41.6860318Z         T: int,
2025-05-07T20:32:41.6860514Z         D: int,
2025-05-07T20:32:41.6860735Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6861014Z         contiguous: bool,
2025-05-07T20:32:41.6861256Z         compiled: bool,
2025-05-07T20:32:41.6861484Z     ) -> None:
2025-05-07T20:32:41.6861702Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6861941Z     
2025-05-07T20:32:41.6862220Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6862577Z     
2025-05-07T20:32:41.6862768Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.6863065Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.6863380Z         x = x_sign * x_clamp
2025-05-07T20:32:41.6863621Z         x0 = x[:, :D]
2025-05-07T20:32:41.6863841Z         x1 = x[:, D:]
2025-05-07T20:32:41.6864049Z     
2025-05-07T20:32:41.6864235Z         if contiguous:
2025-05-07T20:32:41.6864466Z             x0 = x0.contiguous()
2025-05-07T20:32:41.6864723Z             x1 = x1.contiguous()
2025-05-07T20:32:41.6864969Z     
2025-05-07T20:32:41.6865153Z         if scale_ub is not None:
2025-05-07T20:32:41.6865434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.6865778Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.6866098Z             )
2025-05-07T20:32:41.6866297Z         else:
2025-05-07T20:32:41.6866508Z             scale_ub_tensor = None
2025-05-07T20:32:41.6866898Z     
2025-05-07T20:32:41.6867125Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.6867444Z             op = silu_mul_quant
2025-05-07T20:32:41.6867686Z             if compiled:
2025-05-07T20:32:41.6867926Z                 op = torch.compile(op)
2025-05-07T20:32:41.6868220Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6868491Z     
2025-05-07T20:32:41.6868674Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.6868845Z 
2025-05-07T20:32:41.6868940Z moe/activation_test.py:117: 
2025-05-07T20:32:41.6869238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6869574Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.6869981Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6870723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.6871456Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.6872022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.6872750Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.6873452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.6874004Z     kernel = self.compile(
2025-05-07T20:32:41.6874638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.6875331Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.6875735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6876021Z 
2025-05-07T20:32:41.6876231Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd36ce340>
2025-05-07T20:32:41.6877406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.6878905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd37d5280>}
2025-05-07T20:32:41.6880367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.6881466Z context = <triton._C.libtriton.ir.context object at 0x7f3dd37d49f0>
2025-05-07T20:32:41.6881770Z 
2025-05-07T20:32:41.6881934Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.6882484Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.6883154Z                            module_map=module_map)
2025-05-07T20:32:41.6883520Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.6883873Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.6884130Z E       ^
2025-05-07T20:32:41.6884613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.6885102Z 
2025-05-07T20:32:41.6885546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.6886104Z 
2025-05-07T20:32:41.6886203Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6886622Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6887038Z     T=2048,
2025-05-07T20:32:41.6887226Z     D=5120,
2025-05-07T20:32:41.6887410Z     scale_ub=None,
2025-05-07T20:32:41.6887616Z     contiguous=True,
2025-05-07T20:32:41.6887834Z     compiled=False,
2025-05-07T20:32:41.6888028Z )
2025-05-07T20:32:41.6888462Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6888977Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.6889265Z 
2025-05-07T20:32:41.6889339Z     @given(
2025-05-07T20:32:41.6889561Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6889875Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6890186Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6890524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6890852Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6891149Z     )
2025-05-07T20:32:41.6891507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6891961Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6892195Z         self,
2025-05-07T20:32:41.6892382Z         T: int,
2025-05-07T20:32:41.6892563Z         D: int,
2025-05-07T20:32:41.6892782Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6893050Z         contiguous: bool,
2025-05-07T20:32:41.6893287Z         compiled: bool,
2025-05-07T20:32:41.6893509Z     ) -> None:
2025-05-07T20:32:41.6893717Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6893961Z     
2025-05-07T20:32:41.6894226Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6894572Z     
2025-05-07T20:32:41.6894867Z >       x_sign = torch.sign(x)
2025-05-07T20:32:41.6897001Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.6899129Z 
2025-05-07T20:32:41.6899244Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:41.6899467Z 
2025-05-07T20:32:41.6899564Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6899985Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6900397Z     T=16384,
2025-05-07T20:32:41.6900582Z     D=5120,
2025-05-07T20:32:41.6900765Z     scale_ub=None,
2025-05-07T20:32:41.6900972Z     contiguous=True,
2025-05-07T20:32:41.6901182Z     compiled=False,
2025-05-07T20:32:41.6901380Z )
2025-05-07T20:32:41.6901697Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6902207Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.6902504Z 
2025-05-07T20:32:41.6902579Z     @given(
2025-05-07T20:32:41.6902798Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6903109Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6903419Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6903752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6904081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6904367Z     )
2025-05-07T20:32:41.6904716Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6905175Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6905405Z         self,
2025-05-07T20:32:41.6905592Z         T: int,
2025-05-07T20:32:41.6905782Z         D: int,
2025-05-07T20:32:41.6905987Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6906258Z         contiguous: bool,
2025-05-07T20:32:41.6906520Z         compiled: bool,
2025-05-07T20:32:41.6906763Z     ) -> None:
2025-05-07T20:32:41.6906970Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6907203Z     
2025-05-07T20:32:41.6907466Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6909878Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.6911949Z 
2025-05-07T20:32:41.6912064Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.6912285Z 
2025-05-07T20:32:41.6912382Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6912803Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6913217Z     T=4096,
2025-05-07T20:32:41.6913402Z     D=5120,
2025-05-07T20:32:41.6913585Z     scale_ub=None,
2025-05-07T20:32:41.6913788Z     contiguous=True,
2025-05-07T20:32:41.6914003Z     compiled=False,
2025-05-07T20:32:41.6914200Z )
2025-05-07T20:32:41.9884362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9885214Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:41.9885619Z 
2025-05-07T20:32:41.9885729Z     @given(
2025-05-07T20:32:41.9886100Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9886435Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9886770Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9887101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9887431Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9887788Z     )
2025-05-07T20:32:41.9888138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9888597Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9888838Z         self,
2025-05-07T20:32:41.9889025Z         T: int,
2025-05-07T20:32:41.9889218Z         D: int,
2025-05-07T20:32:41.9889424Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9889698Z         contiguous: bool,
2025-05-07T20:32:41.9889933Z         compiled: bool,
2025-05-07T20:32:41.9890151Z     ) -> None:
2025-05-07T20:32:41.9890362Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9890606Z     
2025-05-07T20:32:41.9890870Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9893114Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9895173Z 
2025-05-07T20:32:41.9895288Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.9895507Z 
2025-05-07T20:32:41.9895606Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9896026Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9896441Z     T=2048,
2025-05-07T20:32:41.9896619Z     D=5120,
2025-05-07T20:32:41.9896801Z     scale_ub=None,
2025-05-07T20:32:41.9897009Z     contiguous=False,
2025-05-07T20:32:41.9897232Z     compiled=False,
2025-05-07T20:32:41.9897427Z )
2025-05-07T20:32:41.9897741Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9898255Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.9898544Z 
2025-05-07T20:32:41.9898617Z     @given(
2025-05-07T20:32:41.9898957Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9899269Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9899576Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9899911Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9900236Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9900523Z     )
2025-05-07T20:32:41.9900882Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9901340Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9901577Z         self,
2025-05-07T20:32:41.9901764Z         T: int,
2025-05-07T20:32:41.9901955Z         D: int,
2025-05-07T20:32:41.9902160Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9902430Z         contiguous: bool,
2025-05-07T20:32:41.9902667Z         compiled: bool,
2025-05-07T20:32:41.9902879Z     ) -> None:
2025-05-07T20:32:41.9903091Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9903330Z     
2025-05-07T20:32:41.9903601Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9905828Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9908003Z 
2025-05-07T20:32:41.9908121Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.9908384Z 
2025-05-07T20:32:41.9908482Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9908904Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9909316Z     T=4096,
2025-05-07T20:32:41.9909492Z     D=7168,
2025-05-07T20:32:41.9909676Z     scale_ub=None,
2025-05-07T20:32:41.9910013Z     contiguous=True,
2025-05-07T20:32:41.9910232Z     compiled=True,
2025-05-07T20:32:41.9910429Z )
2025-05-07T20:32:41.9910742Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9911252Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.9911535Z 
2025-05-07T20:32:41.9911612Z     @given(
2025-05-07T20:32:41.9911826Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9912141Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9912447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9912781Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9913109Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9913395Z     )
2025-05-07T20:32:41.9913752Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9914210Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9914451Z         self,
2025-05-07T20:32:41.9914636Z         T: int,
2025-05-07T20:32:41.9914821Z         D: int,
2025-05-07T20:32:41.9915032Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9915300Z         contiguous: bool,
2025-05-07T20:32:41.9915530Z         compiled: bool,
2025-05-07T20:32:41.9915745Z     ) -> None:
2025-05-07T20:32:41.9915958Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9916192Z     
2025-05-07T20:32:41.9916459Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9918767Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9920826Z 
2025-05-07T20:32:41.9920941Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.9921160Z 
2025-05-07T20:32:41.9921263Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9921679Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9922093Z     T=2048,
2025-05-07T20:32:41.9922272Z     D=5120,
2025-05-07T20:32:41.9922454Z     scale_ub=1200.0,
2025-05-07T20:32:41.9922673Z     contiguous=False,
2025-05-07T20:32:41.9922892Z     compiled=False,
2025-05-07T20:32:41.9923085Z )
2025-05-07T20:32:41.9923408Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9923921Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.9924211Z 
2025-05-07T20:32:41.9924292Z     @given(
2025-05-07T20:32:41.9924511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9924825Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9925134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9925463Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9925799Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9926134Z     )
2025-05-07T20:32:41.9926483Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9926938Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9927175Z         self,
2025-05-07T20:32:41.9927365Z         T: int,
2025-05-07T20:32:41.9927559Z         D: int,
2025-05-07T20:32:41.9927811Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9928077Z         contiguous: bool,
2025-05-07T20:32:41.9928305Z         compiled: bool,
2025-05-07T20:32:41.9928520Z     ) -> None:
2025-05-07T20:32:41.9928734Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9928971Z     
2025-05-07T20:32:41.9929234Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9931451Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9933503Z 
2025-05-07T20:32:41.9933620Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.9933834Z 
2025-05-07T20:32:41.9933934Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9934352Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9934766Z     T=4096,
2025-05-07T20:32:41.9934951Z     D=7168,
2025-05-07T20:32:41.9935134Z     scale_ub=1200.0,
2025-05-07T20:32:41.9935347Z     contiguous=True,
2025-05-07T20:32:41.9935560Z     compiled=False,
2025-05-07T20:32:41.9935755Z )
2025-05-07T20:32:41.9936075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9936633Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.9936935Z 
2025-05-07T20:32:41.9937009Z     @given(
2025-05-07T20:32:41.9937229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9937543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9937851Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9938179Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9938511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9938797Z     )
2025-05-07T20:32:41.9939230Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9939688Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9939926Z         self,
2025-05-07T20:32:41.9940108Z         T: int,
2025-05-07T20:32:41.9940295Z         D: int,
2025-05-07T20:32:41.9940504Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9940768Z         contiguous: bool,
2025-05-07T20:32:41.9941007Z         compiled: bool,
2025-05-07T20:32:41.9941221Z     ) -> None:
2025-05-07T20:32:41.9941421Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9941664Z     
2025-05-07T20:32:41.9941929Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9944161Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9946219Z 
2025-05-07T20:32:41.9946338Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:41.9946598Z 
2025-05-07T20:32:41.9946696Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9947116Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9947532Z     T=16384,
2025-05-07T20:32:41.9947716Z     D=7168,
2025-05-07T20:32:41.9947898Z     scale_ub=None,
2025-05-07T20:32:41.9948103Z     contiguous=False,
2025-05-07T20:32:41.9948360Z     compiled=True,
2025-05-07T20:32:41.9948558Z )
2025-05-07T20:32:42.1256846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1257592Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.1258044Z 
2025-05-07T20:32:42.1258153Z     @given(
2025-05-07T20:32:42.1258461Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1258851Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1259166Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1259508Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1259841Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1260137Z     )
2025-05-07T20:32:42.1260503Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1260963Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1261209Z         self,
2025-05-07T20:32:42.1261405Z         T: int,
2025-05-07T20:32:42.1261611Z         D: int,
2025-05-07T20:32:42.1261824Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1262094Z         contiguous: bool,
2025-05-07T20:32:42.1262336Z         compiled: bool,
2025-05-07T20:32:42.1262563Z     ) -> None:
2025-05-07T20:32:42.1262780Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1263020Z     
2025-05-07T20:32:42.1263298Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1265546Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.1267661Z 
2025-05-07T20:32:42.1267781Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.1268002Z 
2025-05-07T20:32:42.1268276Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1268705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1269125Z     T=4096,
2025-05-07T20:32:42.1269313Z     D=7168,
2025-05-07T20:32:42.1275556Z     scale_ub=None,
2025-05-07T20:32:42.1275819Z     contiguous=True,
2025-05-07T20:32:42.1276052Z     compiled=False,
2025-05-07T20:32:42.1276266Z )
2025-05-07T20:32:42.1276606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1277145Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.1277438Z 
2025-05-07T20:32:42.1277516Z     @given(
2025-05-07T20:32:42.1277750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1278074Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1278389Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1278730Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1279078Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1279373Z     )
2025-05-07T20:32:42.1279733Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1280200Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1280440Z         self,
2025-05-07T20:32:42.1280631Z         T: int,
2025-05-07T20:32:42.1280825Z         D: int,
2025-05-07T20:32:42.1281041Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1281420Z         contiguous: bool,
2025-05-07T20:32:42.1281662Z         compiled: bool,
2025-05-07T20:32:42.1281892Z     ) -> None:
2025-05-07T20:32:42.1282105Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1282357Z     
2025-05-07T20:32:42.1282644Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1285258Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.1287319Z 
2025-05-07T20:32:42.1287452Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.1287677Z 
2025-05-07T20:32:42.1287783Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1288214Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1288646Z     T=16384,
2025-05-07T20:32:42.1288843Z     D=7168,
2025-05-07T20:32:42.1289053Z     scale_ub=None,
2025-05-07T20:32:42.1289276Z     contiguous=True,
2025-05-07T20:32:42.1289506Z     compiled=False,
2025-05-07T20:32:42.1289715Z )
2025-05-07T20:32:42.1290043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1290575Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.1290869Z 
2025-05-07T20:32:42.1290964Z     @given(
2025-05-07T20:32:42.1291199Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1291526Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1291850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1292192Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1292534Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1292836Z     )
2025-05-07T20:32:42.1293203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1293670Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1293923Z         self,
2025-05-07T20:32:42.1294120Z         T: int,
2025-05-07T20:32:42.1294312Z         D: int,
2025-05-07T20:32:42.1294531Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1294974Z         contiguous: bool,
2025-05-07T20:32:42.1295212Z         compiled: bool,
2025-05-07T20:32:42.1295440Z     ) -> None:
2025-05-07T20:32:42.1295653Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1295889Z     
2025-05-07T20:32:42.1296162Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1298399Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.1300468Z 
2025-05-07T20:32:42.1300584Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.1300807Z 
2025-05-07T20:32:42.1300911Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1301330Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1301751Z     T=16384,
2025-05-07T20:32:42.1301940Z     D=7168,
2025-05-07T20:32:42.1302123Z     scale_ub=1200.0,
2025-05-07T20:32:42.1302341Z     contiguous=True,
2025-05-07T20:32:42.1302566Z     compiled=False,
2025-05-07T20:32:42.1302823Z )
2025-05-07T20:32:42.1303144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1303661Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.1303957Z 
2025-05-07T20:32:42.1304038Z     @given(
2025-05-07T20:32:42.1304258Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1304643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1304958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1305299Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1305640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1305930Z     )
2025-05-07T20:32:42.1306283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1306746Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1306986Z         self,
2025-05-07T20:32:42.1307173Z         T: int,
2025-05-07T20:32:42.1307370Z         D: int,
2025-05-07T20:32:42.1307586Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1307852Z         contiguous: bool,
2025-05-07T20:32:42.1308091Z         compiled: bool,
2025-05-07T20:32:42.1308311Z     ) -> None:
2025-05-07T20:32:42.1308527Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1308766Z     
2025-05-07T20:32:42.1309041Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1311396Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.1313460Z 
2025-05-07T20:32:42.1313579Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.1313796Z 
2025-05-07T20:32:42.1313900Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1314323Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1314746Z     T=128,
2025-05-07T20:32:42.1314924Z     D=5120,
2025-05-07T20:32:42.1315104Z     scale_ub=1200.0,
2025-05-07T20:32:42.1315326Z     contiguous=False,
2025-05-07T20:32:42.1315549Z     compiled=False,
2025-05-07T20:32:42.1315743Z )
2025-05-07T20:32:42.2946061Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2947034Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.2947384Z 
2025-05-07T20:32:42.2947471Z     @given(
2025-05-07T20:32:42.2947698Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2948023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2948339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2948682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2949017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2949311Z     )
2025-05-07T20:32:42.2949669Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2950233Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2950476Z         self,
2025-05-07T20:32:42.2950673Z         T: int,
2025-05-07T20:32:42.2950867Z         D: int,
2025-05-07T20:32:42.2951092Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2951368Z         contiguous: bool,
2025-05-07T20:32:42.2951603Z         compiled: bool,
2025-05-07T20:32:42.2951831Z     ) -> None:
2025-05-07T20:32:42.2952051Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2952293Z     
2025-05-07T20:32:42.2952566Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2952996Z     
2025-05-07T20:32:42.2953179Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2953474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2953789Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2954025Z         x0 = x[:, :D]
2025-05-07T20:32:42.2954230Z         x1 = x[:, D:]
2025-05-07T20:32:42.2954431Z     
2025-05-07T20:32:42.2954685Z         if contiguous:
2025-05-07T20:32:42.2954906Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2955164Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2955400Z     
2025-05-07T20:32:42.2955583Z         if scale_ub is not None:
2025-05-07T20:32:42.2955860Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2956203Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2956519Z             )
2025-05-07T20:32:42.2956708Z         else:
2025-05-07T20:32:42.2956920Z             scale_ub_tensor = None
2025-05-07T20:32:42.2957166Z     
2025-05-07T20:32:42.2957394Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2957720Z             op = silu_mul_quant
2025-05-07T20:32:42.2957961Z             if compiled:
2025-05-07T20:32:42.2958207Z                 op = torch.compile(op)
2025-05-07T20:32:42.2958503Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2958782Z     
2025-05-07T20:32:42.2958962Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2959134Z 
2025-05-07T20:32:42.2959228Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2959521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2959862Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2960142Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2960881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2961623Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2962188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2962916Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2963622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2964181Z     kernel = self.compile(
2025-05-07T20:32:42.2964750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2965531Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2965941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2966179Z 
2025-05-07T20:32:42.2966389Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd34f0340>
2025-05-07T20:32:42.2967558Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2969062Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd34a6940>}
2025-05-07T20:32:42.2970529Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2971636Z context = <triton._C.libtriton.ir.context object at 0x7f3dd34cbcf0>
2025-05-07T20:32:42.2971941Z 
2025-05-07T20:32:42.2972105Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2972653Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2973139Z                            module_map=module_map)
2025-05-07T20:32:42.2973508Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2973913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2974175Z E       ^
2025-05-07T20:32:42.2974654Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2975145Z 
2025-05-07T20:32:42.2975589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2976184Z 
2025-05-07T20:32:42.2976285Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2976712Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2977131Z     T=2048,
2025-05-07T20:32:42.2977312Z     D=7168,
2025-05-07T20:32:42.2977505Z     scale_ub=None,
2025-05-07T20:32:42.2977715Z     contiguous=False,
2025-05-07T20:32:42.2977940Z     compiled=False,
2025-05-07T20:32:42.2978141Z )
2025-05-07T20:32:42.2978455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2978968Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.2979261Z 
2025-05-07T20:32:42.2979333Z     @given(
2025-05-07T20:32:42.2979553Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2979873Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2980185Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2980515Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2980848Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2981138Z     )
2025-05-07T20:32:42.2981491Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2981947Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2982182Z         self,
2025-05-07T20:32:42.2982368Z         T: int,
2025-05-07T20:32:42.2982552Z         D: int,
2025-05-07T20:32:42.2982939Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2983212Z         contiguous: bool,
2025-05-07T20:32:42.2983447Z         compiled: bool,
2025-05-07T20:32:42.2983667Z     ) -> None:
2025-05-07T20:32:42.2983884Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2984123Z     
2025-05-07T20:32:42.2984398Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2986762Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2988809Z 
2025-05-07T20:32:42.2988927Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2989144Z 
2025-05-07T20:32:42.2989248Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2989664Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2990187Z     T=128,
2025-05-07T20:32:42.2990363Z     D=7168,
2025-05-07T20:32:42.2990541Z     scale_ub=1200.0,
2025-05-07T20:32:42.2990759Z     contiguous=True,
2025-05-07T20:32:42.2990974Z     compiled=True,
2025-05-07T20:32:42.2991166Z )
2025-05-07T20:32:42.3449958Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3450750Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.3451143Z 
2025-05-07T20:32:42.3451234Z     @given(
2025-05-07T20:32:42.3451464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3451781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3452093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3452433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3452883Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3453176Z     )
2025-05-07T20:32:42.3453534Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3453997Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3454235Z         self,
2025-05-07T20:32:42.3454490Z         T: int,
2025-05-07T20:32:42.3454686Z         D: int,
2025-05-07T20:32:42.3454898Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3455174Z         contiguous: bool,
2025-05-07T20:32:42.3455419Z         compiled: bool,
2025-05-07T20:32:42.3455636Z     ) -> None:
2025-05-07T20:32:42.3455850Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3456095Z     
2025-05-07T20:32:42.3456366Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3456722Z     
2025-05-07T20:32:42.3456914Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.3457205Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.3457531Z         x = x_sign * x_clamp
2025-05-07T20:32:42.3457772Z         x0 = x[:, :D]
2025-05-07T20:32:42.3457990Z         x1 = x[:, D:]
2025-05-07T20:32:42.3458192Z     
2025-05-07T20:32:42.3458374Z         if contiguous:
2025-05-07T20:32:42.3458603Z             x0 = x0.contiguous()
2025-05-07T20:32:42.3458864Z             x1 = x1.contiguous()
2025-05-07T20:32:42.3459107Z     
2025-05-07T20:32:42.3459296Z         if scale_ub is not None:
2025-05-07T20:32:42.3459569Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.3459913Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.3460228Z             )
2025-05-07T20:32:42.3460412Z         else:
2025-05-07T20:32:42.3460616Z             scale_ub_tensor = None
2025-05-07T20:32:42.3460864Z     
2025-05-07T20:32:42.3461086Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.3461403Z             op = silu_mul_quant
2025-05-07T20:32:42.3461650Z             if compiled:
2025-05-07T20:32:42.3461892Z                 op = torch.compile(op)
2025-05-07T20:32:42.3462189Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.3462464Z     
2025-05-07T20:32:42.3462653Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.3462821Z 
2025-05-07T20:32:42.3462918Z moe/activation_test.py:117: 
2025-05-07T20:32:42.3463218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.3463559Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.3463833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.3464569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.3465171Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.3465865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.3466604Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.3467169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.3467892Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.3468591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.3469153Z     kernel = self.compile(
2025-05-07T20:32:42.3469851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.3470551Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.3470956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.3471197Z 
2025-05-07T20:32:42.3471409Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd341ee50>
2025-05-07T20:32:42.3472575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.3474120Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3498940>}
2025-05-07T20:32:42.3475616Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.3476721Z context = <triton._C.libtriton.ir.context object at 0x7f3d880cdb30>
2025-05-07T20:32:42.3477026Z 
2025-05-07T20:32:42.3477193Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.3477731Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.3478213Z                            module_map=module_map)
2025-05-07T20:32:42.3478584Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.3478939Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.3479192Z E       ^
2025-05-07T20:32:42.3479678Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.3480171Z 
2025-05-07T20:32:42.3480618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.3481170Z 
2025-05-07T20:32:42.3481277Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3481697Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3482114Z     T=128,
2025-05-07T20:32:42.3482298Z     D=7168,
2025-05-07T20:32:42.3482484Z     scale_ub=1200.0,
2025-05-07T20:32:42.3482696Z     contiguous=True,
2025-05-07T20:32:42.3483079Z     compiled=False,
2025-05-07T20:32:42.3483278Z )
2025-05-07T20:32:42.3483595Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3484108Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.3484392Z 
2025-05-07T20:32:42.3484471Z     @given(
2025-05-07T20:32:42.3484690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3485008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3485318Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3485646Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3486106Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3486395Z     )
2025-05-07T20:32:42.3486748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3487212Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3487446Z         self,
2025-05-07T20:32:42.3487633Z         T: int,
2025-05-07T20:32:42.3487820Z         D: int,
2025-05-07T20:32:42.3488035Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3488305Z         contiguous: bool,
2025-05-07T20:32:42.3488533Z         compiled: bool,
2025-05-07T20:32:42.3488757Z     ) -> None:
2025-05-07T20:32:42.3488973Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3489208Z     
2025-05-07T20:32:42.3489480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3489842Z     
2025-05-07T20:32:42.3490030Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.3490320Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.3492507Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3494620Z 
2025-05-07T20:32:42.3494732Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.3494950Z 
2025-05-07T20:32:42.3495057Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3495536Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3495950Z     T=128,
2025-05-07T20:32:42.3496130Z     D=5120,
2025-05-07T20:32:42.3496327Z     scale_ub=1200.0,
2025-05-07T20:32:42.3496582Z     contiguous=True,
2025-05-07T20:32:42.3496797Z     compiled=True,
2025-05-07T20:32:42.3496988Z )
2025-05-07T20:32:42.3497307Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3497815Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.3498096Z 
2025-05-07T20:32:42.3498173Z     @given(
2025-05-07T20:32:42.3498391Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3498715Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3499023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3499352Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3499684Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3499980Z     )
2025-05-07T20:32:42.3500328Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3500787Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3501023Z         self,
2025-05-07T20:32:42.3501210Z         T: int,
2025-05-07T20:32:42.3501401Z         D: int,
2025-05-07T20:32:42.3501616Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3501882Z         contiguous: bool,
2025-05-07T20:32:42.3502118Z         compiled: bool,
2025-05-07T20:32:42.3502332Z     ) -> None:
2025-05-07T20:32:42.3502542Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3502781Z     
2025-05-07T20:32:42.3503055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3503407Z     
2025-05-07T20:32:42.3503592Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.3503882Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.3506145Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3508195Z 
2025-05-07T20:32:42.3508314Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.3508530Z 
2025-05-07T20:32:42.3508643Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3509056Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3509469Z     T=128,
2025-05-07T20:32:42.3509652Z     D=7168,
2025-05-07T20:32:42.3509913Z     scale_ub=None,
2025-05-07T20:32:42.3510117Z     contiguous=True,
2025-05-07T20:32:42.3510334Z     compiled=True,
2025-05-07T20:32:42.3510530Z )
2025-05-07T20:32:42.5923852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5924613Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.5925008Z 
2025-05-07T20:32:42.5925112Z     @given(
2025-05-07T20:32:42.5925359Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5925683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5931485Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5931839Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5932307Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5932595Z     )
2025-05-07T20:32:42.5932958Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5933432Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5933675Z         self,
2025-05-07T20:32:42.5933875Z         T: int,
2025-05-07T20:32:42.5934135Z         D: int,
2025-05-07T20:32:42.5934350Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5934625Z         contiguous: bool,
2025-05-07T20:32:42.5934870Z         compiled: bool,
2025-05-07T20:32:42.5935109Z     ) -> None:
2025-05-07T20:32:42.5935325Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5935575Z     
2025-05-07T20:32:42.5935854Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5938096Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.5940163Z 
2025-05-07T20:32:42.5940284Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.5940511Z 
2025-05-07T20:32:42.5967572Z FAILED
2025-05-07T20:32:42.5967929Z 
2025-05-07T20:32:42.5968278Z =================================== FAILURES ===================================
2025-05-07T20:32:42.5968958Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:42.5969668Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:42.5970547Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:32:42.5971335Z   |     yield
2025-05-07T20:32:42.5971931Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run
2025-05-07T20:32:42.5972671Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:42.5973456Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
2025-05-07T20:32:42.5974216Z   |     method()
2025-05-07T20:32:42.5975331Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:42.5976390Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5977296Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:42.5978225Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:42.5978945Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:42.5979650Z   +-+---------------- 1 ----------------
2025-05-07T20:32:42.5980044Z     | Traceback (most recent call last):
2025-05-07T20:32:42.5981088Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:42.5982205Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5985516Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.5988549Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:42.5989166Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5989929Z     |     T=2048,
2025-05-07T20:32:42.5990248Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:42.5990801Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:42.5991613Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:42.5992132Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:42.5992720Z     | )
2025-05-07T20:32:42.5992952Z     | 
2025-05-07T20:32:42.5993698Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:42.5994566Z     +---------------- 2 ----------------
2025-05-07T20:32:42.5994966Z     | Traceback (most recent call last):
2025-05-07T20:32:42.5996002Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:42.5996939Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5999752Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6002640Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:42.6003248Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6003830Z     |     T=128,
2025-05-07T20:32:42.6004104Z     |     D=7168,
2025-05-07T20:32:42.6004374Z     |     scale_ub=None,
2025-05-07T20:32:42.6004698Z     |     contiguous=True,
2025-05-07T20:32:42.6005028Z     |     compiled=True,
2025-05-07T20:32:42.6005334Z     | )
2025-05-07T20:32:42.6005534Z     | 
2025-05-07T20:32:42.6006089Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:42.6006891Z     +---------------- 3 ----------------
2025-05-07T20:32:42.6007181Z     | Traceback (most recent call last):
2025-05-07T20:32:42.6007941Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:42.6008770Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6011001Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.6013735Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:42.6014377Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6014971Z     |     T=128,
2025-05-07T20:32:42.6015265Z     |     D=5120,
2025-05-07T20:32:42.6015566Z     |     scale_ub=1200.0,
2025-05-07T20:32:42.6015918Z     |     contiguous=True,
2025-05-07T20:32:42.6016273Z     |     compiled=True,
2025-05-07T20:32:42.6016745Z     | )
2025-05-07T20:32:42.6017014Z     | 
2025-05-07T20:32:42.6017750Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:42.6018410Z     +---------------- 4 ----------------
2025-05-07T20:32:42.6018707Z     | Traceback (most recent call last):
2025-05-07T20:32:42.6019567Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:42.6020355Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.6021059Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:42.6021811Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6022719Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:42.6023728Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.6024602Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:42.6025647Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6026785Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:42.6027941Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6029136Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:42.6030562Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6031745Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:42.6032772Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.6033736Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:42.6034575Z     |     fn()
2025-05-07T20:32:42.6035563Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:42.6036537Z     |     self.fn.run(
2025-05-07T20:32:42.6037346Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:42.6038204Z     |     kernel = self.compile(
2025-05-07T20:32:42.6039092Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:42.6040148Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6041194Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:42.6042364Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6043112Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6043607Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.6043994Z     | ^
2025-05-07T20:32:42.6044676Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6045520Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:42.6046091Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:42.6046891Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6047589Z     |     T=1,  # or any other generated value
2025-05-07T20:32:42.6048030Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:42.6048504Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:42.6049019Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:42.6049599Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:42.6050025Z     | )
2025-05-07T20:32:42.6050276Z     | 
2025-05-07T20:32:42.6051021Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:42.6051912Z     +------------------------------------
2025-05-07T20:32:42.6052425Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:42.6052958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6053543Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6054135Z     T=1,
2025-05-07T20:32:42.6054398Z     D=5120,
2025-05-07T20:32:42.6054666Z     scale_ub=None,
2025-05-07T20:32:42.6054974Z     contiguous=True,
2025-05-07T20:32:42.6055283Z     compiled=True,
2025-05-07T20:32:42.6055587Z )
2025-05-07T20:32:42.6056042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6056791Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.6057170Z 
2025-05-07T20:32:42.6057276Z     @given(
2025-05-07T20:32:42.6057600Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6058048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6058467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6058926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6059395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6059807Z     )
2025-05-07T20:32:42.6060318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6060978Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6061315Z         self,
2025-05-07T20:32:42.6061585Z         T: int,
2025-05-07T20:32:42.6061866Z         D: int,
2025-05-07T20:32:42.6062165Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6062562Z         contiguous: bool,
2025-05-07T20:32:42.6062894Z         compiled: bool,
2025-05-07T20:32:42.6063213Z     ) -> None:
2025-05-07T20:32:42.6063514Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6063870Z     
2025-05-07T20:32:42.6064403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6064907Z     
2025-05-07T20:32:42.6065189Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6065610Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6066060Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6066414Z         x0 = x[:, :D]
2025-05-07T20:32:42.6066723Z         x1 = x[:, D:]
2025-05-07T20:32:42.6067013Z     
2025-05-07T20:32:42.6067278Z         if contiguous:
2025-05-07T20:32:42.6067594Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6067953Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6068288Z     
2025-05-07T20:32:42.6068553Z         if scale_ub is not None:
2025-05-07T20:32:42.6068919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6069374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6069908Z             )
2025-05-07T20:32:42.6070165Z         else:
2025-05-07T20:32:42.6070465Z             scale_ub_tensor = None
2025-05-07T20:32:42.6070830Z     
2025-05-07T20:32:42.6071157Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6071594Z             op = silu_mul_quant
2025-05-07T20:32:42.6071950Z             if compiled:
2025-05-07T20:32:42.6072290Z                 op = torch.compile(op)
2025-05-07T20:32:42.6072717Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6073218Z     
2025-05-07T20:32:42.6073493Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.6073899Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.6074329Z     
2025-05-07T20:32:42.6074671Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6075152Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.6075610Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.6076050Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.6076562Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6076993Z     
2025-05-07T20:32:42.6077262Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.6077540Z 
2025-05-07T20:32:42.6077683Z moe/activation_test.py:126: 
2025-05-07T20:32:42.6078084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6078552Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.6079009Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6080142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.6081219Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.6081965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6083229Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6084191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.6085195Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6086228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.6087263Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6088310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.6089264Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.6090159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.6090930Z     fn()
2025-05-07T20:32:42.6091926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.6092788Z     self.fn.run(
2025-05-07T20:32:42.6093466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6094243Z     kernel = self.compile(
2025-05-07T20:32:42.6095025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6095971Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6096533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6096886Z 
2025-05-07T20:32:42.6097180Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd819d0d0>
2025-05-07T20:32:42.6098792Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6100907Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd80e7040>}
2025-05-07T20:32:42.6102968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6104550Z context = <triton._C.libtriton.ir.context object at 0x7f3dd8b1f270>
2025-05-07T20:32:42.6104967Z 
2025-05-07T20:32:42.6105197Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6105935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6106687Z                            module_map=module_map)
2025-05-07T20:32:42.6107188Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6107686Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.6108071Z E       ^
2025-05-07T20:32:42.6108725Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6109395Z 
2025-05-07T20:32:42.6113332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6114115Z 
2025-05-07T20:32:42.6114260Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6114843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6115409Z     T=2048,
2025-05-07T20:32:42.6115673Z     D=5120,
2025-05-07T20:32:42.6115945Z     scale_ub=1200.0,
2025-05-07T20:32:42.6116256Z     contiguous=True,
2025-05-07T20:32:42.6116579Z     compiled=False,
2025-05-07T20:32:42.6116877Z )
2025-05-07T20:32:42.6117330Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6118073Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.6118478Z 
2025-05-07T20:32:42.6118580Z     @given(
2025-05-07T20:32:42.6118872Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6119277Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6119683Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6120122Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6120553Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6120936Z     )
2025-05-07T20:32:42.6121399Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6121987Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6122303Z         self,
2025-05-07T20:32:42.6122555Z         T: int,
2025-05-07T20:32:42.6122816Z         D: int,
2025-05-07T20:32:42.6123093Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6123451Z         contiguous: bool,
2025-05-07T20:32:42.6123764Z         compiled: bool,
2025-05-07T20:32:42.6124172Z     ) -> None:
2025-05-07T20:32:42.6124464Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6124791Z     
2025-05-07T20:32:42.6125136Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6125593Z     
2025-05-07T20:32:42.6125852Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6126235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6126644Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6126970Z         x0 = x[:, :D]
2025-05-07T20:32:42.6127249Z         x1 = x[:, D:]
2025-05-07T20:32:42.6127522Z     
2025-05-07T20:32:42.6127762Z         if contiguous:
2025-05-07T20:32:42.6128059Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6128401Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6128718Z     
2025-05-07T20:32:42.6128971Z         if scale_ub is not None:
2025-05-07T20:32:42.6129334Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6129780Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6130205Z             )
2025-05-07T20:32:42.6130454Z         else:
2025-05-07T20:32:42.6130729Z             scale_ub_tensor = None
2025-05-07T20:32:42.6131063Z     
2025-05-07T20:32:42.6131359Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6131795Z             op = silu_mul_quant
2025-05-07T20:32:42.6132151Z             if compiled:
2025-05-07T20:32:42.6132475Z                 op = torch.compile(op)
2025-05-07T20:32:42.6132952Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6133344Z     
2025-05-07T20:32:42.6133593Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6133830Z 
2025-05-07T20:32:42.6133960Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6134371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6134896Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6135278Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6136263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6137251Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6137973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6138915Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6139833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6140572Z     kernel = self.compile(
2025-05-07T20:32:42.6141303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6142194Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6142736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6143054Z 
2025-05-07T20:32:42.6143327Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd813bb80>
2025-05-07T20:32:42.6144827Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6146820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd85aa4c0>}
2025-05-07T20:32:42.6148776Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6150385Z context = <triton._C.libtriton.ir.context object at 0x7f3dd715cdb0>
2025-05-07T20:32:42.6150802Z 
2025-05-07T20:32:42.6151036Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6151880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6152540Z                            module_map=module_map)
2025-05-07T20:32:42.6153042Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6153518Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6153898Z E       ^
2025-05-07T20:32:42.6154578Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6155262Z 
2025-05-07T20:32:42.6155883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6156620Z 
2025-05-07T20:32:42.6156761Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6157353Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6157908Z     T=2048,
2025-05-07T20:32:42.6158166Z     D=5120,
2025-05-07T20:32:42.6158436Z     scale_ub=1200.0,
2025-05-07T20:32:42.6158739Z     contiguous=True,
2025-05-07T20:32:42.6159041Z     compiled=True,
2025-05-07T20:32:42.6159298Z )
2025-05-07T20:32:42.6159751Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6160450Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.6160859Z 
2025-05-07T20:32:42.6160985Z     @given(
2025-05-07T20:32:42.6161356Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6161780Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6162213Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6162660Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6163108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6183118Z     )
2025-05-07T20:32:42.6183641Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6184268Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6184608Z         self,
2025-05-07T20:32:42.6184872Z         T: int,
2025-05-07T20:32:42.6185165Z         D: int,
2025-05-07T20:32:42.6185460Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6185868Z         contiguous: bool,
2025-05-07T20:32:42.6186213Z         compiled: bool,
2025-05-07T20:32:42.6186531Z     ) -> None:
2025-05-07T20:32:42.6186847Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6187212Z     
2025-05-07T20:32:42.6187611Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6188094Z     
2025-05-07T20:32:42.6188310Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6188616Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6188937Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6189198Z         x0 = x[:, :D]
2025-05-07T20:32:42.6189425Z         x1 = x[:, D:]
2025-05-07T20:32:42.6189637Z     
2025-05-07T20:32:42.6189933Z         if contiguous:
2025-05-07T20:32:42.6190175Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6190433Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6190680Z     
2025-05-07T20:32:42.6190872Z         if scale_ub is not None:
2025-05-07T20:32:42.6191148Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6191488Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6191807Z             )
2025-05-07T20:32:42.6191994Z         else:
2025-05-07T20:32:42.6192212Z             scale_ub_tensor = None
2025-05-07T20:32:42.6192468Z     
2025-05-07T20:32:42.6192690Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6193023Z             op = silu_mul_quant
2025-05-07T20:32:42.6193278Z             if compiled:
2025-05-07T20:32:42.6193529Z                 op = torch.compile(op)
2025-05-07T20:32:42.6193830Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6194115Z     
2025-05-07T20:32:42.6194309Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.6194945Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.6195247Z     
2025-05-07T20:32:42.6195488Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6195833Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.6196130Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.6196453Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.6196821Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6197134Z     
2025-05-07T20:32:42.6197334Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.6197540Z 
2025-05-07T20:32:42.6197640Z moe/activation_test.py:126: 
2025-05-07T20:32:42.6197939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6198284Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.6198616Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6199467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.6200274Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.6200848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6201579Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6202314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.6203166Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6203971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.6204843Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6205627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.6206301Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.6206938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.6207490Z     fn()
2025-05-07T20:32:42.6208017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.6208638Z     self.fn.run(
2025-05-07T20:32:42.6209122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6209687Z     kernel = self.compile(
2025-05-07T20:32:42.6210249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6210946Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6211357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6211599Z 
2025-05-07T20:32:42.6211815Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd6d25490>
2025-05-07T20:32:42.6212974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6214497Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd6cde0d0>}
2025-05-07T20:32:42.6215966Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6217128Z context = <triton._C.libtriton.ir.context object at 0x7f3dd6a47c70>
2025-05-07T20:32:42.6217429Z 
2025-05-07T20:32:42.6217685Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6218233Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6218726Z                            module_map=module_map)
2025-05-07T20:32:42.6219097Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6219455Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.6219733Z E       ^
2025-05-07T20:32:42.6220219Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6220706Z 
2025-05-07T20:32:42.6221152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6221714Z 
2025-05-07T20:32:42.6221817Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6222244Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6222675Z     T=16384,
2025-05-07T20:32:42.6222864Z     D=7168,
2025-05-07T20:32:42.6223055Z     scale_ub=1200.0,
2025-05-07T20:32:42.6223278Z     contiguous=False,
2025-05-07T20:32:42.6223497Z     compiled=False,
2025-05-07T20:32:42.6223704Z )
2025-05-07T20:32:42.6224029Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6224548Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.6224896Z 
2025-05-07T20:32:42.6224974Z     @given(
2025-05-07T20:32:42.6225196Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6225514Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6225820Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6226201Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6226535Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6226816Z     )
2025-05-07T20:32:42.6227179Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6227642Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6227882Z         self,
2025-05-07T20:32:42.6228080Z         T: int,
2025-05-07T20:32:42.6228274Z         D: int,
2025-05-07T20:32:42.6228486Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6230324Z         contiguous: bool,
2025-05-07T20:32:42.6230568Z         compiled: bool,
2025-05-07T20:32:42.6230789Z     ) -> None:
2025-05-07T20:32:42.6231005Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6231255Z     
2025-05-07T20:32:42.6231523Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6231875Z     
2025-05-07T20:32:42.6232065Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6232358Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6232680Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6232921Z         x0 = x[:, :D]
2025-05-07T20:32:42.6233141Z         x1 = x[:, D:]
2025-05-07T20:32:42.6233348Z     
2025-05-07T20:32:42.6233531Z         if contiguous:
2025-05-07T20:32:42.6233766Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6234023Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6234270Z     
2025-05-07T20:32:42.6234466Z         if scale_ub is not None:
2025-05-07T20:32:42.6234737Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6235079Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6235404Z             )
2025-05-07T20:32:42.6235593Z         else:
2025-05-07T20:32:42.6235812Z             scale_ub_tensor = None
2025-05-07T20:32:42.6236071Z     
2025-05-07T20:32:42.6236296Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6236620Z             op = silu_mul_quant
2025-05-07T20:32:42.6236881Z             if compiled:
2025-05-07T20:32:42.6237135Z                 op = torch.compile(op)
2025-05-07T20:32:42.6237434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6237718Z     
2025-05-07T20:32:42.6238006Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6238180Z 
2025-05-07T20:32:42.6238278Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6238581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6238927Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6239208Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6239948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6240695Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6241268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6241997Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6242706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6243281Z     kernel = self.compile(
2025-05-07T20:32:42.6243846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6244544Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6244954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6245195Z 
2025-05-07T20:32:42.6245462Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd6b80a30>
2025-05-07T20:32:42.6246656Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6248226Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd6c12040>}
2025-05-07T20:32:42.6249697Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6250801Z context = <triton._C.libtriton.ir.context object at 0x7f3dd69e72f0>
2025-05-07T20:32:42.6251105Z 
2025-05-07T20:32:42.6251279Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6251824Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6252315Z                            module_map=module_map)
2025-05-07T20:32:42.6252690Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6253046Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6253318Z E       ^
2025-05-07T20:32:42.6253807Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6254294Z 
2025-05-07T20:32:42.6254757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6255317Z 
2025-05-07T20:32:42.6255423Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6255854Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6256271Z     T=1,
2025-05-07T20:32:42.6256449Z     D=7168,
2025-05-07T20:32:42.6256654Z     scale_ub=None,
2025-05-07T20:32:42.6256911Z     contiguous=True,
2025-05-07T20:32:42.6257132Z     compiled=True,
2025-05-07T20:32:42.6257335Z )
2025-05-07T20:32:42.6257659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6258165Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.6258439Z 
2025-05-07T20:32:42.6258515Z     @given(
2025-05-07T20:32:42.6258748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6259071Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6259465Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6259808Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6260146Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6260435Z     )
2025-05-07T20:32:42.6260795Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6261259Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6261512Z         self,
2025-05-07T20:32:42.6261700Z         T: int,
2025-05-07T20:32:42.6261900Z         D: int,
2025-05-07T20:32:42.6262117Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6262392Z         contiguous: bool,
2025-05-07T20:32:42.6262634Z         compiled: bool,
2025-05-07T20:32:42.6262865Z     ) -> None:
2025-05-07T20:32:42.6263086Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6263335Z     
2025-05-07T20:32:42.6263608Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6263966Z     
2025-05-07T20:32:42.6264162Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6264451Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6264774Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6265017Z         x0 = x[:, :D]
2025-05-07T20:32:42.6265231Z         x1 = x[:, D:]
2025-05-07T20:32:42.6265443Z     
2025-05-07T20:32:42.6265628Z         if contiguous:
2025-05-07T20:32:42.6265867Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6266177Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6266424Z     
2025-05-07T20:32:42.6266624Z         if scale_ub is not None:
2025-05-07T20:32:42.6266895Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6267238Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6267625Z             )
2025-05-07T20:32:42.6267815Z         else:
2025-05-07T20:32:42.6268022Z             scale_ub_tensor = None
2025-05-07T20:32:42.6268284Z     
2025-05-07T20:32:42.6268517Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6268840Z             op = silu_mul_quant
2025-05-07T20:32:42.6269092Z             if compiled:
2025-05-07T20:32:42.6269334Z                 op = torch.compile(op)
2025-05-07T20:32:42.6269638Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6270027Z     
2025-05-07T20:32:42.6270215Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.6270506Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.6270807Z     
2025-05-07T20:32:42.6271044Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6271382Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.6271681Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.6272005Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.6272372Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6272695Z     
2025-05-07T20:32:42.6272898Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.6273104Z 
2025-05-07T20:32:42.6273205Z moe/activation_test.py:126: 
2025-05-07T20:32:42.6273508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6273858Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.6274193Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6275033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.6275852Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.6276432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6277154Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6277891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.6278754Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6279566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.6280364Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6281150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.6281841Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.6282485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.6283303Z     fn()
2025-05-07T20:32:42.6283846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.6284466Z     self.fn.run(
2025-05-07T20:32:42.6284958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6285526Z     kernel = self.compile(
2025-05-07T20:32:42.6286097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6286844Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6287250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6287584Z 
2025-05-07T20:32:42.6287795Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd6b26910>
2025-05-07T20:32:42.6288963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6290537Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd6c12dc0>}
2025-05-07T20:32:42.6292013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6293129Z context = <triton._C.libtriton.ir.context object at 0x7f3dd63d4970>
2025-05-07T20:32:42.6293445Z 
2025-05-07T20:32:42.6293615Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6294163Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6294653Z                            module_map=module_map)
2025-05-07T20:32:42.6295028Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6295397Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.6295666Z E       ^
2025-05-07T20:32:42.6296154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6296696Z 
2025-05-07T20:32:42.6297144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6297701Z 
2025-05-07T20:32:42.6297810Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6298231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6298656Z     T=4096,
2025-05-07T20:32:42.6298840Z     D=5120,
2025-05-07T20:32:42.6299032Z     scale_ub=None,
2025-05-07T20:32:42.6299239Z     contiguous=False,
2025-05-07T20:32:42.6299469Z     compiled=False,
2025-05-07T20:32:42.6299675Z )
2025-05-07T20:32:42.6299998Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6300520Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.6300812Z 
2025-05-07T20:32:42.6300895Z     @given(
2025-05-07T20:32:42.6301235Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6301559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6301876Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6302206Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6302545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6302837Z     )
2025-05-07T20:32:42.6303196Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6303659Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6303907Z         self,
2025-05-07T20:32:42.6304099Z         T: int,
2025-05-07T20:32:42.6304286Z         D: int,
2025-05-07T20:32:42.6304493Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6304761Z         contiguous: bool,
2025-05-07T20:32:42.6304998Z         compiled: bool,
2025-05-07T20:32:42.6305222Z     ) -> None:
2025-05-07T20:32:42.6305437Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6305677Z     
2025-05-07T20:32:42.6305954Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6306313Z     
2025-05-07T20:32:42.6306506Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6306803Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6307121Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6307361Z         x0 = x[:, :D]
2025-05-07T20:32:42.6307578Z         x1 = x[:, D:]
2025-05-07T20:32:42.6307834Z     
2025-05-07T20:32:42.6308017Z         if contiguous:
2025-05-07T20:32:42.6308245Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6308504Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6308749Z     
2025-05-07T20:32:42.6308933Z         if scale_ub is not None:
2025-05-07T20:32:42.6309210Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6309596Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6310030Z             )
2025-05-07T20:32:42.6310224Z         else:
2025-05-07T20:32:42.6310436Z             scale_ub_tensor = None
2025-05-07T20:32:42.6310685Z     
2025-05-07T20:32:42.6310915Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6311236Z             op = silu_mul_quant
2025-05-07T20:32:42.6311483Z             if compiled:
2025-05-07T20:32:42.6311730Z                 op = torch.compile(op)
2025-05-07T20:32:42.6312030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6312306Z     
2025-05-07T20:32:42.6312499Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6312672Z 
2025-05-07T20:32:42.6312770Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6313074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6313413Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6313697Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6314440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6315179Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6315745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6316475Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6317180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6317742Z     kernel = self.compile(
2025-05-07T20:32:42.6318311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6319007Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6319415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6319665Z 
2025-05-07T20:32:42.6319876Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd68a6eb0>
2025-05-07T20:32:42.6321134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6322653Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd6c120d0>}
2025-05-07T20:32:42.6324138Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6325250Z context = <triton._C.libtriton.ir.context object at 0x7f3dd63fcc30>
2025-05-07T20:32:42.6325564Z 
2025-05-07T20:32:42.6325735Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6326285Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6326784Z                            module_map=module_map)
2025-05-07T20:32:42.6327156Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6327520Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6327783Z E       ^
2025-05-07T20:32:42.6328269Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6328763Z 
2025-05-07T20:32:42.6329262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6329829Z 
2025-05-07T20:32:42.6329935Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6330365Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6330828Z     T=4096,
2025-05-07T20:32:42.6331014Z     D=7168,
2025-05-07T20:32:42.6331207Z     scale_ub=None,
2025-05-07T20:32:42.6331416Z     contiguous=False,
2025-05-07T20:32:42.6331638Z     compiled=False,
2025-05-07T20:32:42.6331845Z )
2025-05-07T20:32:42.6332159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6332674Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.6332968Z 
2025-05-07T20:32:42.6333042Z     @given(
2025-05-07T20:32:42.6333271Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6333581Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6333896Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6334234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6334567Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6334861Z     )
2025-05-07T20:32:42.6335221Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6335678Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6335918Z         self,
2025-05-07T20:32:42.6336109Z         T: int,
2025-05-07T20:32:42.6336298Z         D: int,
2025-05-07T20:32:42.6336540Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6336852Z         contiguous: bool,
2025-05-07T20:32:42.6337092Z         compiled: bool,
2025-05-07T20:32:42.6337308Z     ) -> None:
2025-05-07T20:32:42.6337521Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6337764Z     
2025-05-07T20:32:42.6338030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6338389Z     
2025-05-07T20:32:42.6338581Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6338866Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6339183Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6339426Z         x0 = x[:, :D]
2025-05-07T20:32:42.6339633Z         x1 = x[:, D:]
2025-05-07T20:32:42.6346251Z     
2025-05-07T20:32:42.6346473Z         if contiguous:
2025-05-07T20:32:42.6346712Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6346979Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6347223Z     
2025-05-07T20:32:42.6347526Z         if scale_ub is not None:
2025-05-07T20:32:42.6347809Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6348163Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6348482Z             )
2025-05-07T20:32:42.6348686Z         else:
2025-05-07T20:32:42.6348905Z             scale_ub_tensor = None
2025-05-07T20:32:42.6349168Z     
2025-05-07T20:32:42.6349397Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6349833Z             op = silu_mul_quant
2025-05-07T20:32:42.6350099Z             if compiled:
2025-05-07T20:32:42.6350347Z                 op = torch.compile(op)
2025-05-07T20:32:42.6350657Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6350942Z     
2025-05-07T20:32:42.6351135Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6351307Z 
2025-05-07T20:32:42.6351409Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6351714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6352057Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6352347Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6353090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6353840Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6354404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6355216Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6355923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6356487Z     kernel = self.compile(
2025-05-07T20:32:42.6357110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6357815Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6358231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6358470Z 
2025-05-07T20:32:42.6358686Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd6c4de50>
2025-05-07T20:32:42.6359854Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6361358Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd65f33a0>}
2025-05-07T20:32:42.6362823Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6363938Z context = <triton._C.libtriton.ir.context object at 0x7f3dd5f551b0>
2025-05-07T20:32:42.6364245Z 
2025-05-07T20:32:42.6364415Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6364961Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6365453Z                            module_map=module_map)
2025-05-07T20:32:42.6365823Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6366185Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6366451Z E       ^
2025-05-07T20:32:42.6366933Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6367422Z 
2025-05-07T20:32:42.6367872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6368428Z 
2025-05-07T20:32:42.6368532Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6369041Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6369456Z     T=128,
2025-05-07T20:32:42.6369642Z     D=7168,
2025-05-07T20:32:42.6369836Z     scale_ub=None,
2025-05-07T20:32:42.6370052Z     contiguous=False,
2025-05-07T20:32:42.6370282Z     compiled=True,
2025-05-07T20:32:42.6370484Z )
2025-05-07T20:32:42.6370803Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6371318Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6371604Z 
2025-05-07T20:32:42.6371681Z     @given(
2025-05-07T20:32:42.6371909Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6372219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6372530Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6372865Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6373195Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6373495Z     )
2025-05-07T20:32:42.6373854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6374315Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6374547Z         self,
2025-05-07T20:32:42.6374736Z         T: int,
2025-05-07T20:32:42.6374930Z         D: int,
2025-05-07T20:32:42.6375137Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6375454Z         contiguous: bool,
2025-05-07T20:32:42.6375691Z         compiled: bool,
2025-05-07T20:32:42.6375904Z     ) -> None:
2025-05-07T20:32:42.6376112Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6376352Z     
2025-05-07T20:32:42.6376643Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6377030Z     
2025-05-07T20:32:42.6377264Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6377553Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6377866Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6378116Z         x0 = x[:, :D]
2025-05-07T20:32:42.6378328Z         x1 = x[:, D:]
2025-05-07T20:32:42.6378534Z     
2025-05-07T20:32:42.6378715Z         if contiguous:
2025-05-07T20:32:42.6378939Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6379200Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6379440Z     
2025-05-07T20:32:42.6379622Z         if scale_ub is not None:
2025-05-07T20:32:42.6379894Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6380231Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6380542Z             )
2025-05-07T20:32:42.6380728Z         else:
2025-05-07T20:32:42.6380939Z             scale_ub_tensor = None
2025-05-07T20:32:42.6381191Z     
2025-05-07T20:32:42.6381416Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6381745Z             op = silu_mul_quant
2025-05-07T20:32:42.6381991Z             if compiled:
2025-05-07T20:32:42.6382233Z                 op = torch.compile(op)
2025-05-07T20:32:42.6382541Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6383047Z     
2025-05-07T20:32:42.6383294Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.6383581Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.6383881Z     
2025-05-07T20:32:42.6384117Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6384462Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.6384768Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.6385084Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.6385446Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6385767Z     
2025-05-07T20:32:42.6385968Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.6386171Z 
2025-05-07T20:32:42.6386272Z moe/activation_test.py:126: 
2025-05-07T20:32:42.6386574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6387064Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.6387396Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6388235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.6389048Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.6389621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6390465Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6391195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.6391969Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6392779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.6393583Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6394363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.6395046Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.6395684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.6396379Z     fn()
2025-05-07T20:32:42.6396980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.6397678Z     self.fn.run(
2025-05-07T20:32:42.6398214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6398908Z     kernel = self.compile(
2025-05-07T20:32:42.6399554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6400337Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6400788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6401062Z 
2025-05-07T20:32:42.6401296Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd66d41f0>
2025-05-07T20:32:42.6402648Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6404398Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd6791700>}
2025-05-07T20:32:42.6406111Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6407227Z context = <triton._C.libtriton.ir.context object at 0x7f3dd5db6d30>
2025-05-07T20:32:42.6407537Z 
2025-05-07T20:32:42.6407706Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6408246Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6408730Z                            module_map=module_map)
2025-05-07T20:32:42.6409105Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6409469Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.6409729Z E       ^
2025-05-07T20:32:42.6410216Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6410708Z 
2025-05-07T20:32:42.6411153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6411799Z 
2025-05-07T20:32:42.6411910Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6412331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6412747Z     T=128,
2025-05-07T20:32:42.6412929Z     D=7168,
2025-05-07T20:32:42.6413113Z     scale_ub=None,
2025-05-07T20:32:42.6413321Z     contiguous=False,
2025-05-07T20:32:42.6413541Z     compiled=False,
2025-05-07T20:32:42.6413740Z )
2025-05-07T20:32:42.6414057Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6414573Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.6414855Z 
2025-05-07T20:32:42.6414937Z     @given(
2025-05-07T20:32:42.6415158Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6415479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6415786Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6416117Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6416452Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6416739Z     )
2025-05-07T20:32:42.6417090Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6417550Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6417785Z         self,
2025-05-07T20:32:42.6417970Z         T: int,
2025-05-07T20:32:42.6418204Z         D: int,
2025-05-07T20:32:42.6418416Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6418689Z         contiguous: bool,
2025-05-07T20:32:42.6418918Z         compiled: bool,
2025-05-07T20:32:42.6419136Z     ) -> None:
2025-05-07T20:32:42.6419343Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6419577Z     
2025-05-07T20:32:42.6419841Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6420240Z     
2025-05-07T20:32:42.6420422Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6420714Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6421024Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6421258Z         x0 = x[:, :D]
2025-05-07T20:32:42.6421471Z         x1 = x[:, D:]
2025-05-07T20:32:42.6421676Z     
2025-05-07T20:32:42.6421852Z         if contiguous:
2025-05-07T20:32:42.6422079Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6422337Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6422570Z     
2025-05-07T20:32:42.6422759Z         if scale_ub is not None:
2025-05-07T20:32:42.6423031Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6423366Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6423672Z             )
2025-05-07T20:32:42.6423870Z         else:
2025-05-07T20:32:42.6424072Z             scale_ub_tensor = None
2025-05-07T20:32:42.6424331Z     
2025-05-07T20:32:42.6424557Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6424869Z             op = silu_mul_quant
2025-05-07T20:32:42.6425119Z             if compiled:
2025-05-07T20:32:42.6425369Z                 op = torch.compile(op)
2025-05-07T20:32:42.6425660Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6425935Z     
2025-05-07T20:32:42.6426119Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6426286Z 
2025-05-07T20:32:42.6426387Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6426680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6427024Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6427308Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6428036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6428771Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6429338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6430301Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6431000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6431557Z     kernel = self.compile(
2025-05-07T20:32:42.6432121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6432805Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6433216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6433458Z 
2025-05-07T20:32:42.6433668Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd6be2c40>
2025-05-07T20:32:42.6434825Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6436322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd6143310>}
2025-05-07T20:32:42.6437784Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6438929Z context = <triton._C.libtriton.ir.context object at 0x7f3dd5d69cb0>
2025-05-07T20:32:42.6439232Z 
2025-05-07T20:32:42.6439406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6439948Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6440433Z                            module_map=module_map)
2025-05-07T20:32:42.6440847Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6441207Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6441462Z E       ^
2025-05-07T20:32:42.6441952Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6442436Z 
2025-05-07T20:32:42.6442884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6443435Z 
2025-05-07T20:32:42.6443540Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6443959Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6444370Z     T=4096,
2025-05-07T20:32:42.6444549Z     D=5120,
2025-05-07T20:32:42.6444729Z     scale_ub=1200.0,
2025-05-07T20:32:42.6444944Z     contiguous=True,
2025-05-07T20:32:42.6445158Z     compiled=False,
2025-05-07T20:32:42.6445354Z )
2025-05-07T20:32:42.6445678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6446189Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.6446479Z 
2025-05-07T20:32:42.6446555Z     @given(
2025-05-07T20:32:42.6446777Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6447092Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6447401Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6447730Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6448062Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6448354Z     )
2025-05-07T20:32:42.6448702Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6449157Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6449400Z         self,
2025-05-07T20:32:42.6449584Z         T: int,
2025-05-07T20:32:42.6449774Z         D: int,
2025-05-07T20:32:42.6449988Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6450251Z         contiguous: bool,
2025-05-07T20:32:42.6450487Z         compiled: bool,
2025-05-07T20:32:42.6450704Z     ) -> None:
2025-05-07T20:32:42.6451001Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6451238Z     
2025-05-07T20:32:42.6451505Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6451859Z     
2025-05-07T20:32:42.6452044Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6452335Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6452645Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6452879Z         x0 = x[:, :D]
2025-05-07T20:32:42.6453094Z         x1 = x[:, D:]
2025-05-07T20:32:42.6453295Z     
2025-05-07T20:32:42.6453469Z         if contiguous:
2025-05-07T20:32:42.6453704Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6453966Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6454199Z     
2025-05-07T20:32:42.6454391Z         if scale_ub is not None:
2025-05-07T20:32:42.6454664Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6454995Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6455312Z             )
2025-05-07T20:32:42.6455504Z         else:
2025-05-07T20:32:42.6455704Z             scale_ub_tensor = None
2025-05-07T20:32:42.6455950Z     
2025-05-07T20:32:42.6456174Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6456493Z             op = silu_mul_quant
2025-05-07T20:32:42.6456763Z             if compiled:
2025-05-07T20:32:42.6457022Z                 op = torch.compile(op)
2025-05-07T20:32:42.6457387Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6457657Z     
2025-05-07T20:32:42.6457843Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6458005Z 
2025-05-07T20:32:42.6458101Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6458392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6458769Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6459048Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6459778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6460507Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6461065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6461790Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6462484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6463051Z     kernel = self.compile(
2025-05-07T20:32:42.6463467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6463642Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6463773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6463778Z 
2025-05-07T20:32:42.6463993Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd5d88400>
2025-05-07T20:32:42.6464841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6465389Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd60f3c10>}
2025-05-07T20:32:42.6466203Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6466398Z context = <triton._C.libtriton.ir.context object at 0x7f3dd59068f0>
2025-05-07T20:32:42.6466405Z 
2025-05-07T20:32:42.6466577Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6466931Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6467041Z                            module_map=module_map)
2025-05-07T20:32:42.6467201Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6467297Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6467373Z E       ^
2025-05-07T20:32:42.6467751Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6467759Z 
2025-05-07T20:32:42.6468201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6468211Z 
2025-05-07T20:32:42.6468310Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6468542Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6468622Z     T=1,
2025-05-07T20:32:42.6468695Z     D=5120,
2025-05-07T20:32:42.6468774Z     scale_ub=None,
2025-05-07T20:32:42.6468868Z     contiguous=True,
2025-05-07T20:32:42.6468949Z     compiled=True,
2025-05-07T20:32:42.6469019Z )
2025-05-07T20:32:42.6469246Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6469409Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.6469414Z 
2025-05-07T20:32:42.6469487Z     @given(
2025-05-07T20:32:42.6469604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6469836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6469955Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6470068Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6470178Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6470293Z     )
2025-05-07T20:32:42.6470547Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6470638Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6470719Z         self,
2025-05-07T20:32:42.6470799Z         T: int,
2025-05-07T20:32:42.6470871Z         D: int,
2025-05-07T20:32:42.6470969Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6471057Z         contiguous: bool,
2025-05-07T20:32:42.6471144Z         compiled: bool,
2025-05-07T20:32:42.6471220Z     ) -> None:
2025-05-07T20:32:42.6471310Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6471384Z     
2025-05-07T20:32:42.6471555Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6471628Z     
2025-05-07T20:32:42.6471722Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6471844Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6471929Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6472012Z         x0 = x[:, :D]
2025-05-07T20:32:42.6472091Z         x1 = x[:, D:]
2025-05-07T20:32:42.6472162Z     
2025-05-07T20:32:42.6472246Z         if contiguous:
2025-05-07T20:32:42.6472334Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6472433Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6472503Z     
2025-05-07T20:32:42.6472591Z         if scale_ub is not None:
2025-05-07T20:32:42.6472699Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6472834Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6472904Z             )
2025-05-07T20:32:42.6472980Z         else:
2025-05-07T20:32:42.6473072Z             scale_ub_tensor = None
2025-05-07T20:32:42.6473148Z     
2025-05-07T20:32:42.6473278Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6473365Z             op = silu_mul_quant
2025-05-07T20:32:42.6473449Z             if compiled:
2025-05-07T20:32:42.6473549Z                 op = torch.compile(op)
2025-05-07T20:32:42.6473650Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6473727Z     
2025-05-07T20:32:42.6473811Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.6473928Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.6474077Z     
2025-05-07T20:32:42.6474211Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6474309Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.6474407Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.6474529Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.6474669Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6474748Z     
2025-05-07T20:32:42.6474847Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.6474851Z 
2025-05-07T20:32:42.6474947Z moe/activation_test.py:126: 
2025-05-07T20:32:42.6475075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6475172Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.6475312Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6475927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.6476024Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.6476414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6476644Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6477038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.6477344Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6477768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.6478071Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6478475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.6478645Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.6479005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.6479076Z     fn()
2025-05-07T20:32:42.6479509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.6479594Z     self.fn.run(
2025-05-07T20:32:42.6479953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6480052Z     kernel = self.compile(
2025-05-07T20:32:42.6480460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6480645Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6480775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6480783Z 
2025-05-07T20:32:42.6480993Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd5d2cbe0>
2025-05-07T20:32:42.6481844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6482396Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd5e9e3a0>}
2025-05-07T20:32:42.6483467Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6483674Z context = <triton._C.libtriton.ir.context object at 0x7f3dd5cb9fb0>
2025-05-07T20:32:42.6483679Z 
2025-05-07T20:32:42.6484009Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6484331Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6484442Z                            module_map=module_map)
2025-05-07T20:32:42.6484621Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6484727Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.6484806Z E       ^
2025-05-07T20:32:42.6485246Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6485250Z 
2025-05-07T20:32:42.6485756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6485763Z 
2025-05-07T20:32:42.6485874Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6486131Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6486208Z     T=2048,
2025-05-07T20:32:42.6486297Z     D=5120,
2025-05-07T20:32:42.6486379Z     scale_ub=None,
2025-05-07T20:32:42.6486466Z     contiguous=True,
2025-05-07T20:32:42.6486551Z     compiled=True,
2025-05-07T20:32:42.6486623Z )
2025-05-07T20:32:42.6486874Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6487067Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.6487143Z 
2025-05-07T20:32:42.6487223Z     @given(
2025-05-07T20:32:42.6487344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6487436Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6487548Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6487669Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6487837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6487910Z     )
2025-05-07T20:32:42.6488177Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6488268Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6488341Z         self,
2025-05-07T20:32:42.6488417Z         T: int,
2025-05-07T20:32:42.6488491Z         D: int,
2025-05-07T20:32:42.6488588Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6488682Z         contiguous: bool,
2025-05-07T20:32:42.6488764Z         compiled: bool,
2025-05-07T20:32:42.6488843Z     ) -> None:
2025-05-07T20:32:42.6488935Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6489003Z     
2025-05-07T20:32:42.6489177Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6489246Z     
2025-05-07T20:32:42.6489336Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6489462Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6489552Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6489627Z         x0 = x[:, :D]
2025-05-07T20:32:42.6489714Z         x1 = x[:, D:]
2025-05-07T20:32:42.6489787Z     
2025-05-07T20:32:42.6489872Z         if contiguous:
2025-05-07T20:32:42.6489964Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6490053Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6490126Z     
2025-05-07T20:32:42.6490219Z         if scale_ub is not None:
2025-05-07T20:32:42.6490322Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6490456Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6490530Z             )
2025-05-07T20:32:42.6490600Z         else:
2025-05-07T20:32:42.6490696Z             scale_ub_tensor = None
2025-05-07T20:32:42.6490768Z     
2025-05-07T20:32:42.6490894Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6490983Z             op = silu_mul_quant
2025-05-07T20:32:42.6491065Z             if compiled:
2025-05-07T20:32:42.6491164Z                 op = torch.compile(op)
2025-05-07T20:32:42.6491268Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6491336Z     
2025-05-07T20:32:42.6491425Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.6491629Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.6491701Z     
2025-05-07T20:32:42.6491839Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6491937Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.6492035Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.6492156Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.6492300Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6492373Z     
2025-05-07T20:32:42.6492480Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.6492485Z 
2025-05-07T20:32:42.6496859Z moe/activation_test.py:126: 
2025-05-07T20:32:42.6497017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6497135Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.6497276Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6497901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.6498003Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.6498394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6498626Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6499094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.6499365Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6499793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.6500111Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6500516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.6500688Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.6501060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.6501135Z     fn()
2025-05-07T20:32:42.6501575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.6501660Z     self.fn.run(
2025-05-07T20:32:42.6502021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6502125Z     kernel = self.compile(
2025-05-07T20:32:42.6502538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6502717Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6502854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6502859Z 
2025-05-07T20:32:42.6503073Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd5f933a0>
2025-05-07T20:32:42.6503931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6504483Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd5cd69d0>}
2025-05-07T20:32:42.6505309Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6505587Z context = <triton._C.libtriton.ir.context object at 0x7f3dd5893670>
2025-05-07T20:32:42.6505593Z 
2025-05-07T20:32:42.6505766Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6506047Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6506159Z                            module_map=module_map)
2025-05-07T20:32:42.6506329Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6506433Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.6506510Z E       ^
2025-05-07T20:32:42.6506896Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6506901Z 
2025-05-07T20:32:42.6507349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6507357Z 
2025-05-07T20:32:42.6507465Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6507700Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6507777Z     T=128,
2025-05-07T20:32:42.6507857Z     D=5120,
2025-05-07T20:32:42.6507938Z     scale_ub=None,
2025-05-07T20:32:42.6508023Z     contiguous=True,
2025-05-07T20:32:42.6508109Z     compiled=True,
2025-05-07T20:32:42.6508179Z )
2025-05-07T20:32:42.6508405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6508624Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.6508628Z 
2025-05-07T20:32:42.6508707Z     @given(
2025-05-07T20:32:42.6508823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6508928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6509041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6509203Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6509320Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6509394Z     )
2025-05-07T20:32:42.6509658Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6509852Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6509930Z         self,
2025-05-07T20:32:42.6510014Z         T: int,
2025-05-07T20:32:42.6510093Z         D: int,
2025-05-07T20:32:42.6510189Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6510281Z         contiguous: bool,
2025-05-07T20:32:42.6510369Z         compiled: bool,
2025-05-07T20:32:42.6510456Z     ) -> None:
2025-05-07T20:32:42.6510552Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6510623Z     
2025-05-07T20:32:42.6510800Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6510875Z     
2025-05-07T20:32:42.6510966Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6511096Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6511184Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6511264Z         x0 = x[:, :D]
2025-05-07T20:32:42.6511353Z         x1 = x[:, D:]
2025-05-07T20:32:42.6511427Z     
2025-05-07T20:32:42.6511509Z         if contiguous:
2025-05-07T20:32:42.6511602Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6511694Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6511764Z     
2025-05-07T20:32:42.6511860Z         if scale_ub is not None:
2025-05-07T20:32:42.6511966Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6512104Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6512182Z             )
2025-05-07T20:32:42.6512258Z         else:
2025-05-07T20:32:42.6512353Z             scale_ub_tensor = None
2025-05-07T20:32:42.6512425Z     
2025-05-07T20:32:42.6512555Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6512648Z             op = silu_mul_quant
2025-05-07T20:32:42.6512732Z             if compiled:
2025-05-07T20:32:42.6512830Z                 op = torch.compile(op)
2025-05-07T20:32:42.6512938Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6513093Z     
2025-05-07T20:32:42.6513185Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.6513309Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.6513387Z     
2025-05-07T20:32:42.6513523Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6513627Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.6513727Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.6513857Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.6513999Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6514075Z     
2025-05-07T20:32:42.6514180Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.6514185Z 
2025-05-07T20:32:42.6514283Z moe/activation_test.py:126: 
2025-05-07T20:32:42.6514415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6514524Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.6514664Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6515278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.6515378Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.6515762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6516041Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6516433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.6516698Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6517197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.6517470Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6517876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.6518047Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.6518412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.6518491Z     fn()
2025-05-07T20:32:42.6518925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.6519011Z     self.fn.run(
2025-05-07T20:32:42.6519373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6519468Z     kernel = self.compile(
2025-05-07T20:32:42.6519881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6520062Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6520192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6520201Z 
2025-05-07T20:32:42.6520412Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd62be250>
2025-05-07T20:32:42.6521262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6521825Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd5944d30>}
2025-05-07T20:32:42.6522728Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6522934Z context = <triton._C.libtriton.ir.context object at 0x7f3dd53afc70>
2025-05-07T20:32:42.6522939Z 
2025-05-07T20:32:42.6523107Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6523383Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6523495Z                            module_map=module_map)
2025-05-07T20:32:42.6523661Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6523762Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.6523841Z E       ^
2025-05-07T20:32:42.6524225Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6524232Z 
2025-05-07T20:32:42.6524683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6524688Z 
2025-05-07T20:32:42.6524794Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6525023Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6525100Z     T=4096,
2025-05-07T20:32:42.6525177Z     D=5120,
2025-05-07T20:32:42.6525262Z     scale_ub=None,
2025-05-07T20:32:42.6525346Z     contiguous=True,
2025-05-07T20:32:42.6525429Z     compiled=True,
2025-05-07T20:32:42.6525501Z )
2025-05-07T20:32:42.6525770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6525945Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.6525949Z 
2025-05-07T20:32:42.6526029Z     @given(
2025-05-07T20:32:42.6526147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6526289Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6526412Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6526527Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6526649Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6526724Z     )
2025-05-07T20:32:42.6526981Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6527081Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6527155Z         self,
2025-05-07T20:32:42.6527232Z         T: int,
2025-05-07T20:32:42.6527308Z         D: int,
2025-05-07T20:32:42.6527409Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6527501Z         contiguous: bool,
2025-05-07T20:32:42.6527589Z         compiled: bool,
2025-05-07T20:32:42.6527668Z     ) -> None:
2025-05-07T20:32:42.6527762Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6527840Z     
2025-05-07T20:32:42.6528010Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6528091Z     
2025-05-07T20:32:42.6528184Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6528308Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6528402Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6528481Z         x0 = x[:, :D]
2025-05-07T20:32:42.6528560Z         x1 = x[:, D:]
2025-05-07T20:32:42.6528637Z     
2025-05-07T20:32:42.6528720Z         if contiguous:
2025-05-07T20:32:42.6528809Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6528903Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6528978Z     
2025-05-07T20:32:42.6529070Z         if scale_ub is not None:
2025-05-07T20:32:42.6529186Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6529320Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6529397Z             )
2025-05-07T20:32:42.6529472Z         else:
2025-05-07T20:32:42.6529566Z             scale_ub_tensor = None
2025-05-07T20:32:42.6529645Z     
2025-05-07T20:32:42.6529777Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6529865Z             op = silu_mul_quant
2025-05-07T20:32:42.6529950Z             if compiled:
2025-05-07T20:32:42.6530161Z                 op = torch.compile(op)
2025-05-07T20:32:42.6530268Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6530343Z     
2025-05-07T20:32:42.6530433Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.6530553Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.6530627Z     
2025-05-07T20:32:42.6530761Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6530865Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.6530966Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.6531088Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.6531232Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6531308Z     
2025-05-07T20:32:42.6531412Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.6531416Z 
2025-05-07T20:32:42.6531513Z moe/activation_test.py:126: 
2025-05-07T20:32:42.6531651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6531757Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.6531898Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6532505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.6532607Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.6533038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6533268Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6533663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.6533970Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6534405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.6534670Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6535069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.6535242Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.6535606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.6535680Z     fn()
2025-05-07T20:32:42.6536114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.6536199Z     self.fn.run(
2025-05-07T20:32:42.6536588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6536701Z     kernel = self.compile(
2025-05-07T20:32:42.6537112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6537294Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6537424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6537428Z 
2025-05-07T20:32:42.6537638Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd53fb1c0>
2025-05-07T20:32:42.6538493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6539039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd5944670>}
2025-05-07T20:32:42.6539934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6540133Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4e7c0b0>
2025-05-07T20:32:42.6540138Z 
2025-05-07T20:32:42.6540310Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6540586Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6540695Z                            module_map=module_map)
2025-05-07T20:32:42.6540866Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6540969Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.6541048Z E       ^
2025-05-07T20:32:42.6541433Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6541440Z 
2025-05-07T20:32:42.6541887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6541891Z 
2025-05-07T20:32:42.6541995Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6542226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6542304Z     T=16384,
2025-05-07T20:32:42.6542383Z     D=5120,
2025-05-07T20:32:42.6542464Z     scale_ub=None,
2025-05-07T20:32:42.6542590Z     contiguous=True,
2025-05-07T20:32:42.6542676Z     compiled=True,
2025-05-07T20:32:42.6542752Z )
2025-05-07T20:32:42.6542982Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6543155Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.6543159Z 
2025-05-07T20:32:42.6543275Z     @given(
2025-05-07T20:32:42.6543399Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6543496Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6543614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6543732Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6543844Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6543918Z     )
2025-05-07T20:32:42.6544174Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6544268Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6544347Z         self,
2025-05-07T20:32:42.6544426Z         T: int,
2025-05-07T20:32:42.6544502Z         D: int,
2025-05-07T20:32:42.6544604Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6544694Z         contiguous: bool,
2025-05-07T20:32:42.6544779Z         compiled: bool,
2025-05-07T20:32:42.6544859Z     ) -> None:
2025-05-07T20:32:42.6544952Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6545026Z     
2025-05-07T20:32:42.6545200Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6545273Z     
2025-05-07T20:32:42.6545364Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6545497Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6545581Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6545664Z         x0 = x[:, :D]
2025-05-07T20:32:42.6545744Z         x1 = x[:, D:]
2025-05-07T20:32:42.6545811Z     
2025-05-07T20:32:42.6545893Z         if contiguous:
2025-05-07T20:32:42.6545984Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6546073Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6546155Z     
2025-05-07T20:32:42.6546246Z         if scale_ub is not None:
2025-05-07T20:32:42.6546351Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6546489Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6546562Z             )
2025-05-07T20:32:42.6546637Z         else:
2025-05-07T20:32:42.6546739Z             scale_ub_tensor = None
2025-05-07T20:32:42.6546808Z     
2025-05-07T20:32:42.6546942Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6547115Z             op = silu_mul_quant
2025-05-07T20:32:42.6547201Z             if compiled:
2025-05-07T20:32:42.6547302Z                 op = torch.compile(op)
2025-05-07T20:32:42.6547405Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6547478Z     
2025-05-07T20:32:42.6547570Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.6547689Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.6547762Z     
2025-05-07T20:32:42.6547902Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6547999Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.6548093Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.6548216Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.6548355Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6548432Z     
2025-05-07T20:32:42.6548529Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.6548534Z 
2025-05-07T20:32:42.6548633Z moe/activation_test.py:126: 
2025-05-07T20:32:42.6548769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6548870Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.6549003Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6549612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.6549984Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.6550376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6550611Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6551045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.6551324Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6551750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.6552016Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6552419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.6552589Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.6552956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.6553032Z     fn()
2025-05-07T20:32:42.6553458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.6553542Z     self.fn.run(
2025-05-07T20:32:42.6553900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6554000Z     kernel = self.compile(
2025-05-07T20:32:42.6554404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6554580Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6554712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6554719Z 
2025-05-07T20:32:42.6554928Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd4f94e80>
2025-05-07T20:32:42.6555779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6556331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd5a87dc0>}
2025-05-07T20:32:42.6557221Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6557421Z context = <triton._C.libtriton.ir.context object at 0x7f3dd49fa970>
2025-05-07T20:32:42.6557426Z 
2025-05-07T20:32:42.6557595Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6557874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6557978Z                            module_map=module_map)
2025-05-07T20:32:42.6558138Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6558241Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.6558320Z E       ^
2025-05-07T20:32:42.6558703Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6558718Z 
2025-05-07T20:32:42.6559162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6559167Z 
2025-05-07T20:32:42.6559267Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6559501Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6559577Z     T=1,
2025-05-07T20:32:42.6559694Z     D=5120,
2025-05-07T20:32:42.6559777Z     scale_ub=1200.0,
2025-05-07T20:32:42.6559860Z     contiguous=True,
2025-05-07T20:32:42.6559943Z     compiled=True,
2025-05-07T20:32:42.6560016Z )
2025-05-07T20:32:42.6560239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6560414Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.6560460Z 
2025-05-07T20:32:42.6560533Z     @given(
2025-05-07T20:32:42.6560649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6560754Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6560868Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6560983Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6561097Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6561169Z     )
2025-05-07T20:32:42.6561428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6561521Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6561592Z         self,
2025-05-07T20:32:42.6561677Z         T: int,
2025-05-07T20:32:42.6561752Z         D: int,
2025-05-07T20:32:42.6561850Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6561941Z         contiguous: bool,
2025-05-07T20:32:42.6562024Z         compiled: bool,
2025-05-07T20:32:42.6562107Z     ) -> None:
2025-05-07T20:32:42.6562202Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6562274Z     
2025-05-07T20:32:42.6562447Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6562524Z     
2025-05-07T20:32:42.6562613Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6562739Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6562830Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6562910Z         x0 = x[:, :D]
2025-05-07T20:32:42.6562987Z         x1 = x[:, D:]
2025-05-07T20:32:42.6563065Z     
2025-05-07T20:32:42.6563146Z         if contiguous:
2025-05-07T20:32:42.6563241Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6563331Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6563404Z     
2025-05-07T20:32:42.6563494Z         if scale_ub is not None:
2025-05-07T20:32:42.6563604Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6563738Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6563813Z             )
2025-05-07T20:32:42.6563888Z         else:
2025-05-07T20:32:42.6563980Z             scale_ub_tensor = None
2025-05-07T20:32:42.6564054Z     
2025-05-07T20:32:42.6564260Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6564349Z             op = silu_mul_quant
2025-05-07T20:32:42.6564434Z             if compiled:
2025-05-07T20:32:42.6564531Z                 op = torch.compile(op)
2025-05-07T20:32:42.6564633Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6564704Z     
2025-05-07T20:32:42.6564793Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6564800Z 
2025-05-07T20:32:42.6564893Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6565025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6565122Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6565222Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6565614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6565706Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6566251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6566348Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6566727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6566961Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6567361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6567454Z     kernel = self.compile(
2025-05-07T20:32:42.6567861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6568036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6568210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6568215Z 
2025-05-07T20:32:42.6568427Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd556e9d0>
2025-05-07T20:32:42.6569279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6569824Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd55e9ca0>}
2025-05-07T20:32:42.6570636Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6570836Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4952e70>
2025-05-07T20:32:42.6570841Z 
2025-05-07T20:32:42.6571007Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6571289Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6571393Z                            module_map=module_map)
2025-05-07T20:32:42.6571553Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6571652Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6571725Z E       ^
2025-05-07T20:32:42.6572105Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6572115Z 
2025-05-07T20:32:42.6572559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6572564Z 
2025-05-07T20:32:42.6572665Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6572901Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6572975Z     T=1,
2025-05-07T20:32:42.6573048Z     D=5120,
2025-05-07T20:32:42.6573211Z     scale_ub=None,
2025-05-07T20:32:42.6573296Z     contiguous=False,
2025-05-07T20:32:42.6573376Z     compiled=True,
2025-05-07T20:32:42.6573449Z )
2025-05-07T20:32:42.6573671Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6573839Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6573843Z 
2025-05-07T20:32:42.6573918Z     @given(
2025-05-07T20:32:42.6574036Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6574134Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6574247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6574362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6574474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6574551Z     )
2025-05-07T20:32:42.6574807Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6574906Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6574985Z         self,
2025-05-07T20:32:42.6575065Z         T: int,
2025-05-07T20:32:42.6575138Z         D: int,
2025-05-07T20:32:42.6575234Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6575324Z         contiguous: bool,
2025-05-07T20:32:42.6575406Z         compiled: bool,
2025-05-07T20:32:42.6575483Z     ) -> None:
2025-05-07T20:32:42.6575579Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6575692Z     
2025-05-07T20:32:42.6575861Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6575934Z     
2025-05-07T20:32:42.6576022Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6576144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6576232Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6576350Z         x0 = x[:, :D]
2025-05-07T20:32:42.6576429Z         x1 = x[:, D:]
2025-05-07T20:32:42.6576502Z     
2025-05-07T20:32:42.6576583Z         if contiguous:
2025-05-07T20:32:42.6576675Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6576766Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6576838Z     
2025-05-07T20:32:42.6576934Z         if scale_ub is not None:
2025-05-07T20:32:42.6577038Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6577170Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6577244Z             )
2025-05-07T20:32:42.6577318Z         else:
2025-05-07T20:32:42.6577411Z             scale_ub_tensor = None
2025-05-07T20:32:42.6577483Z     
2025-05-07T20:32:42.6577609Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6577703Z             op = silu_mul_quant
2025-05-07T20:32:42.6577786Z             if compiled:
2025-05-07T20:32:42.6577883Z                 op = torch.compile(op)
2025-05-07T20:32:42.6577995Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6578066Z     
2025-05-07T20:32:42.6578154Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.6578275Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.6578350Z     
2025-05-07T20:32:42.6578485Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6578586Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.6578681Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.6578800Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.6578942Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6579017Z     
2025-05-07T20:32:42.6579123Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.6579127Z 
2025-05-07T20:32:42.6579222Z moe/activation_test.py:126: 
2025-05-07T20:32:42.6579351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6579455Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.6579590Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6580296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.6580399Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.6580782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6581016Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6581404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.6581670Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6582097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.6582360Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6582964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.6583207Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.6583580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.6583670Z     fn()
2025-05-07T20:32:42.6584105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.6584277Z     self.fn.run(
2025-05-07T20:32:42.6584646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6584745Z     kernel = self.compile(
2025-05-07T20:32:42.6585160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6585401Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6585538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6585543Z 
2025-05-07T20:32:42.6585756Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd4feda60>
2025-05-07T20:32:42.6586604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6587157Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd59ebdc0>}
2025-05-07T20:32:42.6587970Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6588170Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4421ab0>
2025-05-07T20:32:42.6588178Z 
2025-05-07T20:32:42.6588352Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6588630Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6588740Z                            module_map=module_map)
2025-05-07T20:32:42.6588903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6589006Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.6589093Z E       ^
2025-05-07T20:32:42.6589477Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6589482Z 
2025-05-07T20:32:42.6590007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6590014Z 
2025-05-07T20:32:42.6590121Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6590355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6590556Z     T=1,
2025-05-07T20:32:42.6590634Z     D=5120,
2025-05-07T20:32:42.6590718Z     scale_ub=None,
2025-05-07T20:32:42.6590805Z     contiguous=True,
2025-05-07T20:32:42.6590887Z     compiled=False,
2025-05-07T20:32:42.6590957Z )
2025-05-07T20:32:42.6591189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6591361Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.6591369Z 
2025-05-07T20:32:42.6591452Z     @given(
2025-05-07T20:32:42.6591573Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6591672Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6591792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6591915Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6592035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6592112Z     )
2025-05-07T20:32:42.6592381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6592476Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6592559Z         self,
2025-05-07T20:32:42.6592640Z         T: int,
2025-05-07T20:32:42.6592720Z         D: int,
2025-05-07T20:32:42.6592821Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6592910Z         contiguous: bool,
2025-05-07T20:32:42.6592999Z         compiled: bool,
2025-05-07T20:32:42.6593123Z     ) -> None:
2025-05-07T20:32:42.6593221Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6593299Z     
2025-05-07T20:32:42.6593475Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6593553Z     
2025-05-07T20:32:42.6593650Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6593776Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6593907Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6593992Z         x0 = x[:, :D]
2025-05-07T20:32:42.6594076Z         x1 = x[:, D:]
2025-05-07T20:32:42.6594151Z     
2025-05-07T20:32:42.6594241Z         if contiguous:
2025-05-07T20:32:42.6594333Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6594428Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6594501Z     
2025-05-07T20:32:42.6594595Z         if scale_ub is not None:
2025-05-07T20:32:42.6594706Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6594845Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6594928Z             )
2025-05-07T20:32:42.6595013Z         else:
2025-05-07T20:32:42.6595111Z             scale_ub_tensor = None
2025-05-07T20:32:42.6595184Z     
2025-05-07T20:32:42.6595321Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6595412Z             op = silu_mul_quant
2025-05-07T20:32:42.6595497Z             if compiled:
2025-05-07T20:32:42.6595608Z                 op = torch.compile(op)
2025-05-07T20:32:42.6595716Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6595799Z     
2025-05-07T20:32:42.6595896Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6595901Z 
2025-05-07T20:32:42.6595999Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6596141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6596247Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6596349Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6596898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6597000Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6597393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6597630Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6598001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6598101Z     kernel = self.compile(
2025-05-07T20:32:42.6598595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6598778Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6598911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6598916Z 
2025-05-07T20:32:42.6599130Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd5095160>
2025-05-07T20:32:42.6599985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6600541Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd54c78b0>}
2025-05-07T20:32:42.6601369Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6601571Z context = <triton._C.libtriton.ir.context object at 0x7f3dd47e4c70>
2025-05-07T20:32:42.6601576Z 
2025-05-07T20:32:42.6601747Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6602083Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6602191Z                            module_map=module_map)
2025-05-07T20:32:42.6602354Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6602453Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6602571Z E       ^
2025-05-07T20:32:42.6602955Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6602959Z 
2025-05-07T20:32:42.6603411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6603416Z 
2025-05-07T20:32:42.6603522Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6603755Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6603832Z     T=128,
2025-05-07T20:32:42.6603918Z     D=5120,
2025-05-07T20:32:42.6604006Z     scale_ub=None,
2025-05-07T20:32:42.6604096Z     contiguous=False,
2025-05-07T20:32:42.6604186Z     compiled=True,
2025-05-07T20:32:42.6604265Z )
2025-05-07T20:32:42.6604494Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6604676Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6604685Z 
2025-05-07T20:32:42.6604765Z     @given(
2025-05-07T20:32:42.6604888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6604991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6605117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6605239Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6605355Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6605432Z     )
2025-05-07T20:32:42.6605696Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6605793Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6605874Z         self,
2025-05-07T20:32:42.6605958Z         T: int,
2025-05-07T20:32:42.6606039Z         D: int,
2025-05-07T20:32:42.6606140Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6606233Z         contiguous: bool,
2025-05-07T20:32:42.6606321Z         compiled: bool,
2025-05-07T20:32:42.6606398Z     ) -> None:
2025-05-07T20:32:42.6606504Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6606579Z     
2025-05-07T20:32:42.6606757Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6606831Z     
2025-05-07T20:32:42.6607002Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6607131Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6607219Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6607302Z         x0 = x[:, :D]
2025-05-07T20:32:42.6607387Z         x1 = x[:, D:]
2025-05-07T20:32:42.6607462Z     
2025-05-07T20:32:42.6607548Z         if contiguous:
2025-05-07T20:32:42.6607644Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6607738Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6607810Z     
2025-05-07T20:32:42.6607904Z         if scale_ub is not None:
2025-05-07T20:32:42.6608009Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6608150Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6608232Z             )
2025-05-07T20:32:42.6608314Z         else:
2025-05-07T20:32:42.6608416Z             scale_ub_tensor = None
2025-05-07T20:32:42.6608496Z     
2025-05-07T20:32:42.6608629Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6608730Z             op = silu_mul_quant
2025-05-07T20:32:42.6608816Z             if compiled:
2025-05-07T20:32:42.6608918Z                 op = torch.compile(op)
2025-05-07T20:32:42.6609030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6609106Z     
2025-05-07T20:32:42.6609202Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6609206Z 
2025-05-07T20:32:42.6609312Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6609487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6609590Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6609689Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6610083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6610220Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6610761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6610862Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6611250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6611487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6611854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6611951Z     kernel = self.compile(
2025-05-07T20:32:42.6612364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6612550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6612687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6612691Z 
2025-05-07T20:32:42.6612911Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd437e280>
2025-05-07T20:32:42.6613768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6614319Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd510b5e0>}
2025-05-07T20:32:42.6615145Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6615346Z context = <triton._C.libtriton.ir.context object at 0x7f3dd43b6e70>
2025-05-07T20:32:42.6615354Z 
2025-05-07T20:32:42.6615529Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6615911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6616020Z                            module_map=module_map)
2025-05-07T20:32:42.6616189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6616291Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6616375Z E       ^
2025-05-07T20:32:42.6616763Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6616770Z 
2025-05-07T20:32:42.6617220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6617225Z 
2025-05-07T20:32:42.6617334Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6617567Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6617658Z     T=128,
2025-05-07T20:32:42.6617739Z     D=7168,
2025-05-07T20:32:42.6617826Z     scale_ub=1200.0,
2025-05-07T20:32:42.6617921Z     contiguous=False,
2025-05-07T20:32:42.6618013Z     compiled=False,
2025-05-07T20:32:42.6618094Z )
2025-05-07T20:32:42.6618326Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6618506Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.6618510Z 
2025-05-07T20:32:42.6618588Z     @given(
2025-05-07T20:32:42.6618712Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6618856Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6618971Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6619092Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6619204Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6619322Z     )
2025-05-07T20:32:42.6619582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6619673Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6619758Z         self,
2025-05-07T20:32:42.6619843Z         T: int,
2025-05-07T20:32:42.6619923Z         D: int,
2025-05-07T20:32:42.6620029Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6620118Z         contiguous: bool,
2025-05-07T20:32:42.6620208Z         compiled: bool,
2025-05-07T20:32:42.6620292Z     ) -> None:
2025-05-07T20:32:42.6620395Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6620475Z     
2025-05-07T20:32:42.6620655Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6620737Z     
2025-05-07T20:32:42.6620833Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6620960Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6621052Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6621137Z         x0 = x[:, :D]
2025-05-07T20:32:42.6621226Z         x1 = x[:, D:]
2025-05-07T20:32:42.6621302Z     
2025-05-07T20:32:42.6621392Z         if contiguous:
2025-05-07T20:32:42.6621488Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6621584Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6621664Z     
2025-05-07T20:32:42.6621758Z         if scale_ub is not None:
2025-05-07T20:32:42.6621865Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6622015Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6622095Z             )
2025-05-07T20:32:42.6622179Z         else:
2025-05-07T20:32:42.6622276Z             scale_ub_tensor = None
2025-05-07T20:32:42.6622354Z     
2025-05-07T20:32:42.6622489Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6622581Z             op = silu_mul_quant
2025-05-07T20:32:42.6622668Z             if compiled:
2025-05-07T20:32:42.6622775Z                 op = torch.compile(op)
2025-05-07T20:32:42.6622882Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6622961Z     
2025-05-07T20:32:42.6623058Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6623063Z 
2025-05-07T20:32:42.6623161Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6623379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6623485Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6623585Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6624134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6624235Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6624625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6624866Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6625240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6625340Z     kernel = self.compile(
2025-05-07T20:32:42.6625759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6625949Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6626092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6626097Z 
2025-05-07T20:32:42.6626315Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd47d7820>
2025-05-07T20:32:42.6631151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6631807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4add430>}
2025-05-07T20:32:42.6632677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6632879Z context = <triton._C.libtriton.ir.context object at 0x7f3dd48ac2b0>
2025-05-07T20:32:42.6632884Z 
2025-05-07T20:32:42.6633062Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6633342Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6633450Z                            module_map=module_map)
2025-05-07T20:32:42.6633620Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6633720Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6633804Z E       ^
2025-05-07T20:32:42.6634195Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6634203Z 
2025-05-07T20:32:42.6634658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6634663Z 
2025-05-07T20:32:42.6634778Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6635015Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6635098Z     T=128,
2025-05-07T20:32:42.6635185Z     D=5120,
2025-05-07T20:32:42.6635273Z     scale_ub=None,
2025-05-07T20:32:42.6635369Z     contiguous=False,
2025-05-07T20:32:42.6635458Z     compiled=False,
2025-05-07T20:32:42.6635538Z )
2025-05-07T20:32:42.6635769Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6635957Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.6635962Z 
2025-05-07T20:32:42.6636044Z     @given(
2025-05-07T20:32:42.6636171Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6636278Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6636398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6636523Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6636720Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6636801Z     )
2025-05-07T20:32:42.6637066Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6637165Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6637257Z         self,
2025-05-07T20:32:42.6637341Z         T: int,
2025-05-07T20:32:42.6637424Z         D: int,
2025-05-07T20:32:42.6637532Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6637624Z         contiguous: bool,
2025-05-07T20:32:42.6637716Z         compiled: bool,
2025-05-07T20:32:42.6637803Z     ) -> None:
2025-05-07T20:32:42.6637900Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6637978Z     
2025-05-07T20:32:42.6638159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6638240Z     
2025-05-07T20:32:42.6638336Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6638468Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6638565Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6638649Z         x0 = x[:, :D]
2025-05-07T20:32:42.6638734Z         x1 = x[:, D:]
2025-05-07T20:32:42.6638811Z     
2025-05-07T20:32:42.6638902Z         if contiguous:
2025-05-07T20:32:42.6638997Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6639088Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6639168Z     
2025-05-07T20:32:42.6639264Z         if scale_ub is not None:
2025-05-07T20:32:42.6639418Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6639560Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6639639Z             )
2025-05-07T20:32:42.6639713Z         else:
2025-05-07T20:32:42.6639818Z             scale_ub_tensor = None
2025-05-07T20:32:42.6639933Z     
2025-05-07T20:32:42.6640063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6640153Z             op = silu_mul_quant
2025-05-07T20:32:42.6640238Z             if compiled:
2025-05-07T20:32:42.6640350Z                 op = torch.compile(op)
2025-05-07T20:32:42.6640458Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6640532Z     
2025-05-07T20:32:42.6640627Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6640631Z 
2025-05-07T20:32:42.6640727Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6640861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6640966Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6641066Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6641614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6641720Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6642116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6642356Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6642726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6642826Z     kernel = self.compile(
2025-05-07T20:32:42.6643243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6643427Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6643566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6643574Z 
2025-05-07T20:32:42.6643789Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd54e1bb0>
2025-05-07T20:32:42.6644647Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6645295Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4addca0>}
2025-05-07T20:32:42.6646111Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6646316Z context = <triton._C.libtriton.ir.context object at 0x7f3dd48685b0>
2025-05-07T20:32:42.6646323Z 
2025-05-07T20:32:42.6646494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6646771Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6646883Z                            module_map=module_map)
2025-05-07T20:32:42.6647050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6647155Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6647236Z E       ^
2025-05-07T20:32:42.6647626Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6647632Z 
2025-05-07T20:32:42.6648088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6648092Z 
2025-05-07T20:32:42.6648197Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6648470Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6648558Z     T=128,
2025-05-07T20:32:42.6648640Z     D=5120,
2025-05-07T20:32:42.6648729Z     scale_ub=1200.0,
2025-05-07T20:32:42.6648819Z     contiguous=True,
2025-05-07T20:32:42.6648907Z     compiled=False,
2025-05-07T20:32:42.6648988Z )
2025-05-07T20:32:42.6649284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6649462Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.6649467Z 
2025-05-07T20:32:42.6649555Z     @given(
2025-05-07T20:32:42.6649680Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6649782Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6649901Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6650022Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6650144Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6650225Z     )
2025-05-07T20:32:42.6650489Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6650590Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6650670Z         self,
2025-05-07T20:32:42.6650751Z         T: int,
2025-05-07T20:32:42.6650835Z         D: int,
2025-05-07T20:32:42.6650934Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6651030Z         contiguous: bool,
2025-05-07T20:32:42.6651121Z         compiled: bool,
2025-05-07T20:32:42.6651205Z     ) -> None:
2025-05-07T20:32:42.6651309Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6651390Z     
2025-05-07T20:32:42.6651568Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6651646Z     
2025-05-07T20:32:42.6651741Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6651869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6651964Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6652047Z         x0 = x[:, :D]
2025-05-07T20:32:42.6652133Z         x1 = x[:, D:]
2025-05-07T20:32:42.6652213Z     
2025-05-07T20:32:42.6652298Z         if contiguous:
2025-05-07T20:32:42.6652393Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6652489Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6652565Z     
2025-05-07T20:32:42.6652659Z         if scale_ub is not None:
2025-05-07T20:32:42.6652772Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6652910Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6652997Z             )
2025-05-07T20:32:42.6653159Z         else:
2025-05-07T20:32:42.6653254Z             scale_ub_tensor = None
2025-05-07T20:32:42.6653331Z     
2025-05-07T20:32:42.6653460Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6653549Z             op = silu_mul_quant
2025-05-07T20:32:42.6653634Z             if compiled:
2025-05-07T20:32:42.6653734Z                 op = torch.compile(op)
2025-05-07T20:32:42.6653838Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6653918Z     
2025-05-07T20:32:42.6654008Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6654013Z 
2025-05-07T20:32:42.6654110Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6654248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6654353Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6654462Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6655012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6655113Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6655509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6655749Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6656118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6656262Z     kernel = self.compile(
2025-05-07T20:32:42.6656672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6656853Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6657024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6657029Z 
2025-05-07T20:32:42.6657242Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd4ff09d0>
2025-05-07T20:32:42.6658101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6658651Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd5a4d1f0>}
2025-05-07T20:32:42.6659473Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6659672Z context = <triton._C.libtriton.ir.context object at 0x7f3dd47517f0>
2025-05-07T20:32:42.6659679Z 
2025-05-07T20:32:42.6659852Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6660135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6660242Z                            module_map=module_map)
2025-05-07T20:32:42.6660411Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6660510Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6660586Z E       ^
2025-05-07T20:32:42.6660972Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6660979Z 
2025-05-07T20:32:42.6661426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6661431Z 
2025-05-07T20:32:42.6661535Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6661766Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6661848Z     T=1,
2025-05-07T20:32:42.6661928Z     D=7168,
2025-05-07T20:32:42.6662014Z     scale_ub=1200.0,
2025-05-07T20:32:42.6662102Z     contiguous=True,
2025-05-07T20:32:42.6662277Z     compiled=True,
2025-05-07T20:32:42.6662359Z )
2025-05-07T20:32:42.6662586Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6662760Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.6662765Z 
2025-05-07T20:32:42.6662848Z     @given(
2025-05-07T20:32:42.6662972Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6663076Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6663196Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6663324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6663443Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6663523Z     )
2025-05-07T20:32:42.6663788Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6663886Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6663971Z         self,
2025-05-07T20:32:42.6664057Z         T: int,
2025-05-07T20:32:42.6664140Z         D: int,
2025-05-07T20:32:42.6664244Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6664338Z         contiguous: bool,
2025-05-07T20:32:42.6664427Z         compiled: bool,
2025-05-07T20:32:42.6664515Z     ) -> None:
2025-05-07T20:32:42.6664613Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6664689Z     
2025-05-07T20:32:42.6664868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6664990Z     
2025-05-07T20:32:42.6665083Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6665215Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6665304Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6665390Z         x0 = x[:, :D]
2025-05-07T20:32:42.6665517Z         x1 = x[:, D:]
2025-05-07T20:32:42.6665591Z     
2025-05-07T20:32:42.6665678Z         if contiguous:
2025-05-07T20:32:42.6665768Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6665857Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6665942Z     
2025-05-07T20:32:42.6666035Z         if scale_ub is not None:
2025-05-07T20:32:42.6666140Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6666278Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6666356Z             )
2025-05-07T20:32:42.6666432Z         else:
2025-05-07T20:32:42.6666527Z             scale_ub_tensor = None
2025-05-07T20:32:42.6666604Z     
2025-05-07T20:32:42.6666732Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6666828Z             op = silu_mul_quant
2025-05-07T20:32:42.6666914Z             if compiled:
2025-05-07T20:32:42.6667021Z                 op = torch.compile(op)
2025-05-07T20:32:42.6667129Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6667206Z     
2025-05-07T20:32:42.6667304Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6667308Z 
2025-05-07T20:32:42.6667407Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6667546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6667654Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6667756Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6668155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6668255Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6668801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6668909Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6669296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6669536Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6670016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6670197Z     kernel = self.compile(
2025-05-07T20:32:42.6670616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6670794Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6670926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6670933Z 
2025-05-07T20:32:42.6671147Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd55b4340>
2025-05-07T20:32:42.6671995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6672553Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4723820>}
2025-05-07T20:32:42.6673371Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6673568Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4810a70>
2025-05-07T20:32:42.6673573Z 
2025-05-07T20:32:42.6673743Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6674674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6674787Z                            module_map=module_map)
2025-05-07T20:32:42.6674953Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6675055Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6675180Z E       ^
2025-05-07T20:32:42.6675560Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6675564Z 
2025-05-07T20:32:42.6676016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6676024Z 
2025-05-07T20:32:42.6676128Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6676359Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6676444Z     T=1,
2025-05-07T20:32:42.6676535Z     D=7168,
2025-05-07T20:32:42.6676626Z     scale_ub=1200.0,
2025-05-07T20:32:42.6676720Z     contiguous=False,
2025-05-07T20:32:42.6676810Z     compiled=True,
2025-05-07T20:32:42.6676888Z )
2025-05-07T20:32:42.6677120Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6677296Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.6677304Z 
2025-05-07T20:32:42.6677387Z     @given(
2025-05-07T20:32:42.6677510Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6677616Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6677739Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6677866Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6677983Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6678066Z     )
2025-05-07T20:32:42.6678328Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6678429Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6678515Z         self,
2025-05-07T20:32:42.6678597Z         T: int,
2025-05-07T20:32:42.6678677Z         D: int,
2025-05-07T20:32:42.6678782Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6678874Z         contiguous: bool,
2025-05-07T20:32:42.6678970Z         compiled: bool,
2025-05-07T20:32:42.6679057Z     ) -> None:
2025-05-07T20:32:42.6679156Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6679234Z     
2025-05-07T20:32:42.6679412Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6679572Z     
2025-05-07T20:32:42.6679669Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6679796Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6679884Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6679969Z         x0 = x[:, :D]
2025-05-07T20:32:42.6680048Z         x1 = x[:, D:]
2025-05-07T20:32:42.6680122Z     
2025-05-07T20:32:42.6680212Z         if contiguous:
2025-05-07T20:32:42.6680309Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6680402Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6680480Z     
2025-05-07T20:32:42.6680575Z         if scale_ub is not None:
2025-05-07T20:32:42.6680686Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6680823Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6680907Z             )
2025-05-07T20:32:42.6680993Z         else:
2025-05-07T20:32:42.6681093Z             scale_ub_tensor = None
2025-05-07T20:32:42.6681172Z     
2025-05-07T20:32:42.6681311Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6681407Z             op = silu_mul_quant
2025-05-07T20:32:42.6681496Z             if compiled:
2025-05-07T20:32:42.6681603Z                 op = torch.compile(op)
2025-05-07T20:32:42.6681715Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6681801Z     
2025-05-07T20:32:42.6681895Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6681966Z 
2025-05-07T20:32:42.6682066Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6682203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6682308Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6682412Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6683022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6683284Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6683832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6683936Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6684323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6684564Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6684930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6685030Z     kernel = self.compile(
2025-05-07T20:32:42.6685448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6685629Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6685772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6685777Z 
2025-05-07T20:32:42.6685996Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd476bf70>
2025-05-07T20:32:42.6686849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6687402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd46a64c0>}
2025-05-07T20:32:42.6688220Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6688425Z context = <triton._C.libtriton.ir.context object at 0x7f3dd46b4630>
2025-05-07T20:32:42.6688430Z 
2025-05-07T20:32:42.6688602Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6689000Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6689118Z                            module_map=module_map)
2025-05-07T20:32:42.6689284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6689392Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6689469Z E       ^
2025-05-07T20:32:42.6689854Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6689861Z 
2025-05-07T20:32:42.6690317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6690322Z 
2025-05-07T20:32:42.6690427Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6690669Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6690744Z     T=1,
2025-05-07T20:32:42.6690823Z     D=7168,
2025-05-07T20:32:42.6690913Z     scale_ub=None,
2025-05-07T20:32:42.6691005Z     contiguous=False,
2025-05-07T20:32:42.6691088Z     compiled=True,
2025-05-07T20:32:42.6691167Z )
2025-05-07T20:32:42.6691392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6691561Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6691566Z 
2025-05-07T20:32:42.6691649Z     @given(
2025-05-07T20:32:42.6691830Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6691933Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6692050Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6692166Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6692283Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6692404Z     )
2025-05-07T20:32:42.6692663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6692765Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6692852Z         self,
2025-05-07T20:32:42.6692935Z         T: int,
2025-05-07T20:32:42.6693022Z         D: int,
2025-05-07T20:32:42.6693124Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6693218Z         contiguous: bool,
2025-05-07T20:32:42.6693309Z         compiled: bool,
2025-05-07T20:32:42.6693391Z     ) -> None:
2025-05-07T20:32:42.6693491Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6693570Z     
2025-05-07T20:32:42.6693745Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6693831Z     
2025-05-07T20:32:42.6693925Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6694052Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6694145Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6694244Z         x0 = x[:, :D]
2025-05-07T20:32:42.6694328Z         x1 = x[:, D:]
2025-05-07T20:32:42.6694406Z     
2025-05-07T20:32:42.6694496Z         if contiguous:
2025-05-07T20:32:42.6694592Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6694691Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6694767Z     
2025-05-07T20:32:42.6694863Z         if scale_ub is not None:
2025-05-07T20:32:42.6694974Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6695112Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6695193Z             )
2025-05-07T20:32:42.6695277Z         else:
2025-05-07T20:32:42.6695377Z             scale_ub_tensor = None
2025-05-07T20:32:42.6695453Z     
2025-05-07T20:32:42.6695589Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6695680Z             op = silu_mul_quant
2025-05-07T20:32:42.6695767Z             if compiled:
2025-05-07T20:32:42.6695875Z                 op = torch.compile(op)
2025-05-07T20:32:42.6695987Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6696066Z     
2025-05-07T20:32:42.6696160Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.6696285Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.6696451Z     
2025-05-07T20:32:42.6696590Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6696692Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.6696798Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.6696920Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.6697060Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6697139Z     
2025-05-07T20:32:42.6697238Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.6697242Z 
2025-05-07T20:32:42.6697344Z moe/activation_test.py:126: 
2025-05-07T20:32:42.6697475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6697576Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.6697717Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.6698329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.6698430Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.6698819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6699051Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6699445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.6699756Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6700187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:42.6700496Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.6700902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.6701076Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.6701439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.6701521Z     fn()
2025-05-07T20:32:42.6701954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.6702043Z     self.fn.run(
2025-05-07T20:32:42.6702405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6702508Z     kernel = self.compile(
2025-05-07T20:32:42.6702918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6703106Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6703244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6703249Z 
2025-05-07T20:32:42.6703465Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd46e7eb0>
2025-05-07T20:32:42.6704320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6704873Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4636040>}
2025-05-07T20:32:42.6705692Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6705895Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4637f70>
2025-05-07T20:32:42.6705899Z 
2025-05-07T20:32:42.6706160Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6706443Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6706550Z                            module_map=module_map)
2025-05-07T20:32:42.6706719Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6706820Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.6706904Z E       ^
2025-05-07T20:32:42.6707296Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6707301Z 
2025-05-07T20:32:42.6707750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6707758Z 
2025-05-07T20:32:42.6707866Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6708100Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6708189Z     T=1,
2025-05-07T20:32:42.6708270Z     D=5120,
2025-05-07T20:32:42.6708354Z     scale_ub=1200.0,
2025-05-07T20:32:42.6708448Z     contiguous=False,
2025-05-07T20:32:42.6708536Z     compiled=True,
2025-05-07T20:32:42.6708613Z )
2025-05-07T20:32:42.6708840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6709019Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.6709065Z 
2025-05-07T20:32:42.6709143Z     @given(
2025-05-07T20:32:42.6709264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6709362Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6709477Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6709599Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6709828Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6709906Z     )
2025-05-07T20:32:42.6710172Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6710267Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6710347Z         self,
2025-05-07T20:32:42.6710433Z         T: int,
2025-05-07T20:32:42.6710516Z         D: int,
2025-05-07T20:32:42.6710618Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6710713Z         contiguous: bool,
2025-05-07T20:32:42.6710800Z         compiled: bool,
2025-05-07T20:32:42.6710882Z     ) -> None:
2025-05-07T20:32:42.6710982Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6711060Z     
2025-05-07T20:32:42.6711239Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6711320Z     
2025-05-07T20:32:42.6711416Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6711547Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6711644Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6711729Z         x0 = x[:, :D]
2025-05-07T20:32:42.6711814Z         x1 = x[:, D:]
2025-05-07T20:32:42.6711888Z     
2025-05-07T20:32:42.6711976Z         if contiguous:
2025-05-07T20:32:42.6712072Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6712163Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6712241Z     
2025-05-07T20:32:42.6712336Z         if scale_ub is not None:
2025-05-07T20:32:42.6712444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6712585Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6712669Z             )
2025-05-07T20:32:42.6712746Z         else:
2025-05-07T20:32:42.6712846Z             scale_ub_tensor = None
2025-05-07T20:32:42.6712922Z     
2025-05-07T20:32:42.6713054Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6713151Z             op = silu_mul_quant
2025-05-07T20:32:42.6713237Z             if compiled:
2025-05-07T20:32:42.6713340Z                 op = torch.compile(op)
2025-05-07T20:32:42.6713453Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6713529Z     
2025-05-07T20:32:42.6713728Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6713737Z 
2025-05-07T20:32:42.6713840Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6713974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6714082Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6714182Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6714576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6714675Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6715217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6715317Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6715702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6715935Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6716306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6716400Z     kernel = self.compile(
2025-05-07T20:32:42.6716810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6716993Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6717169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6717174Z 
2025-05-07T20:32:42.6717390Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd46ea880>
2025-05-07T20:32:42.6718242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6718836Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4636f70>}
2025-05-07T20:32:42.6719651Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6719848Z context = <triton._C.libtriton.ir.context object at 0x7f3dd45db630>
2025-05-07T20:32:42.6719855Z 
2025-05-07T20:32:42.6720026Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6720302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6720412Z                            module_map=module_map)
2025-05-07T20:32:42.6720588Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6720689Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6720774Z E       ^
2025-05-07T20:32:42.6721163Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6721167Z 
2025-05-07T20:32:42.6721614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6721619Z 
2025-05-07T20:32:42.6721727Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6721965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6722052Z     T=1,
2025-05-07T20:32:42.6722132Z     D=5120,
2025-05-07T20:32:42.6722217Z     scale_ub=1200.0,
2025-05-07T20:32:42.6722306Z     contiguous=False,
2025-05-07T20:32:42.6722390Z     compiled=False,
2025-05-07T20:32:42.6722469Z )
2025-05-07T20:32:42.6722704Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6722879Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.6722883Z 
2025-05-07T20:32:42.6723046Z     @given(
2025-05-07T20:32:42.6723168Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6723268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6723385Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6723501Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6723613Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6723689Z     )
2025-05-07T20:32:42.6723947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6724038Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6724121Z         self,
2025-05-07T20:32:42.6724199Z         T: int,
2025-05-07T20:32:42.6724278Z         D: int,
2025-05-07T20:32:42.6724383Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6724476Z         contiguous: bool,
2025-05-07T20:32:42.6724563Z         compiled: bool,
2025-05-07T20:32:42.6724644Z     ) -> None:
2025-05-07T20:32:42.6724745Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6724827Z     
2025-05-07T20:32:42.6725001Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6725079Z     
2025-05-07T20:32:42.6725176Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6725303Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6725396Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6725482Z         x0 = x[:, :D]
2025-05-07T20:32:42.6725606Z         x1 = x[:, D:]
2025-05-07T20:32:42.6725679Z     
2025-05-07T20:32:42.6725764Z         if contiguous:
2025-05-07T20:32:42.6725854Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6725943Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6726019Z     
2025-05-07T20:32:42.6726108Z         if scale_ub is not None:
2025-05-07T20:32:42.6726254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6726391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6726469Z             )
2025-05-07T20:32:42.6726553Z         else:
2025-05-07T20:32:42.6726644Z             scale_ub_tensor = None
2025-05-07T20:32:42.6726716Z     
2025-05-07T20:32:42.6726848Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6726938Z             op = silu_mul_quant
2025-05-07T20:32:42.6727021Z             if compiled:
2025-05-07T20:32:42.6727127Z                 op = torch.compile(op)
2025-05-07T20:32:42.6727234Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6727310Z     
2025-05-07T20:32:42.6727404Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6727408Z 
2025-05-07T20:32:42.6727509Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6727647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6727749Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6727854Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6728405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6728506Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6728891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6729135Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6729500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6729601Z     kernel = self.compile(
2025-05-07T20:32:42.6730011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6730190Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6730337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6730342Z 
2025-05-07T20:32:42.6730555Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd45e7c40>
2025-05-07T20:32:42.6731493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6732042Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd46d1700>}
2025-05-07T20:32:42.6732856Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6733055Z context = <triton._C.libtriton.ir.context object at 0x7f3dd455fef0>
2025-05-07T20:32:42.6733061Z 
2025-05-07T20:32:42.6733231Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6733513Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6733620Z                            module_map=module_map)
2025-05-07T20:32:42.6733782Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6733883Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6733962Z E       ^
2025-05-07T20:32:42.6734347Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6734396Z 
2025-05-07T20:32:42.6734840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6734844Z 
2025-05-07T20:32:42.6734945Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6735181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6735302Z     T=16384,
2025-05-07T20:32:42.6735380Z     D=5120,
2025-05-07T20:32:42.6735471Z     scale_ub=1200.0,
2025-05-07T20:32:42.6735560Z     contiguous=False,
2025-05-07T20:32:42.6735642Z     compiled=True,
2025-05-07T20:32:42.6735716Z )
2025-05-07T20:32:42.6735943Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6736131Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.6736136Z 
2025-05-07T20:32:42.6736213Z     @given(
2025-05-07T20:32:42.6736332Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6736435Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6736554Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6736672Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6736790Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6736867Z     )
2025-05-07T20:32:42.6737134Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6737231Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6737310Z         self,
2025-05-07T20:32:42.6737395Z         T: int,
2025-05-07T20:32:42.6737473Z         D: int,
2025-05-07T20:32:42.6737575Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6737671Z         contiguous: bool,
2025-05-07T20:32:42.6737758Z         compiled: bool,
2025-05-07T20:32:42.6737837Z     ) -> None:
2025-05-07T20:32:42.6737936Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6738012Z     
2025-05-07T20:32:42.6738186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6738269Z     
2025-05-07T20:32:42.6738362Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6738488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6738581Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6738663Z         x0 = x[:, :D]
2025-05-07T20:32:42.6738752Z         x1 = x[:, D:]
2025-05-07T20:32:42.6738827Z     
2025-05-07T20:32:42.6738911Z         if contiguous:
2025-05-07T20:32:42.6739007Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6739179Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6739256Z     
2025-05-07T20:32:42.6739350Z         if scale_ub is not None:
2025-05-07T20:32:42.6739455Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6739593Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6739674Z             )
2025-05-07T20:32:42.6739754Z         else:
2025-05-07T20:32:42.6739847Z             scale_ub_tensor = None
2025-05-07T20:32:42.6739924Z     
2025-05-07T20:32:42.6740053Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6740145Z             op = silu_mul_quant
2025-05-07T20:32:42.6740228Z             if compiled:
2025-05-07T20:32:42.6740326Z                 op = torch.compile(op)
2025-05-07T20:32:42.6740436Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6740513Z     
2025-05-07T20:32:42.6740602Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6740606Z 
2025-05-07T20:32:42.6740705Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6740842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6740942Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6741042Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6741434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6741527Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6742131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6742229Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6742616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6742887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6743250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6743350Z     kernel = self.compile(
2025-05-07T20:32:42.6743759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6743941Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6744070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6744077Z 
2025-05-07T20:32:42.6744288Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd4872fd0>
2025-05-07T20:32:42.6745139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6745691Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4addd30>}
2025-05-07T20:32:42.6746511Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6746709Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4054af0>
2025-05-07T20:32:42.6746714Z 
2025-05-07T20:32:42.6746884Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6747163Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6747267Z                            module_map=module_map)
2025-05-07T20:32:42.6747430Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6747528Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6747606Z E       ^
2025-05-07T20:32:42.6747990Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6748097Z 
2025-05-07T20:32:42.6748545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6748549Z 
2025-05-07T20:32:42.6748654Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6748886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6748967Z     T=2048,
2025-05-07T20:32:42.6749056Z     D=7168,
2025-05-07T20:32:42.6749143Z     scale_ub=1200.0,
2025-05-07T20:32:42.6749233Z     contiguous=False,
2025-05-07T20:32:42.6749323Z     compiled=True,
2025-05-07T20:32:42.6749398Z )
2025-05-07T20:32:42.6749628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6749912Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.6749919Z 
2025-05-07T20:32:42.6750000Z     @given(
2025-05-07T20:32:42.6750127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6750234Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6750353Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6750481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6750596Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6750675Z     )
2025-05-07T20:32:42.6750939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6751080Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6751162Z         self,
2025-05-07T20:32:42.6751240Z         T: int,
2025-05-07T20:32:42.6751318Z         D: int,
2025-05-07T20:32:42.6751423Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6751516Z         contiguous: bool,
2025-05-07T20:32:42.6751606Z         compiled: bool,
2025-05-07T20:32:42.6751733Z     ) -> None:
2025-05-07T20:32:42.6752032Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6752108Z     
2025-05-07T20:32:42.6752292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6752366Z     
2025-05-07T20:32:42.6752458Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6752590Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6752682Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6752762Z         x0 = x[:, :D]
2025-05-07T20:32:42.6752848Z         x1 = x[:, D:]
2025-05-07T20:32:42.6752922Z     
2025-05-07T20:32:42.6753011Z         if contiguous:
2025-05-07T20:32:42.6753106Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6753195Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6753272Z     
2025-05-07T20:32:42.6753367Z         if scale_ub is not None:
2025-05-07T20:32:42.6753476Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6753618Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6753703Z             )
2025-05-07T20:32:42.6753782Z         else:
2025-05-07T20:32:42.6753878Z             scale_ub_tensor = None
2025-05-07T20:32:42.6753950Z     
2025-05-07T20:32:42.6754088Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6754183Z             op = silu_mul_quant
2025-05-07T20:32:42.6754269Z             if compiled:
2025-05-07T20:32:42.6754374Z                 op = torch.compile(op)
2025-05-07T20:32:42.6754481Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6754554Z     
2025-05-07T20:32:42.6754649Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6754656Z 
2025-05-07T20:32:42.6754753Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6754887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6754994Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6755097Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6755492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6755592Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6756218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6756322Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6756707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6756944Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6757320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6757417Z     kernel = self.compile(
2025-05-07T20:32:42.6761734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6761942Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6762085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6762090Z 
2025-05-07T20:32:42.6762315Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3ff66d0>
2025-05-07T20:32:42.6763179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6763741Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4059b80>}
2025-05-07T20:32:42.6764639Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6764887Z context = <triton._C.libtriton.ir.context object at 0x7f3dd453a870>
2025-05-07T20:32:42.6764892Z 
2025-05-07T20:32:42.6765066Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6765354Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6765469Z                            module_map=module_map)
2025-05-07T20:32:42.6765641Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6765750Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6765831Z E       ^
2025-05-07T20:32:42.6766220Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6766227Z 
2025-05-07T20:32:42.6766687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6766691Z 
2025-05-07T20:32:42.6766799Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6767041Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6767123Z     T=1,
2025-05-07T20:32:42.6767206Z     D=5120,
2025-05-07T20:32:42.6767300Z     scale_ub=None,
2025-05-07T20:32:42.6767395Z     contiguous=False,
2025-05-07T20:32:42.6767483Z     compiled=False,
2025-05-07T20:32:42.6767565Z )
2025-05-07T20:32:42.6767797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6767976Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.6767983Z 
2025-05-07T20:32:42.6768066Z     @given(
2025-05-07T20:32:42.6768194Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6768307Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6768427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6768550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6768677Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6768759Z     )
2025-05-07T20:32:42.6769023Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6769124Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6769290Z         self,
2025-05-07T20:32:42.6769373Z         T: int,
2025-05-07T20:32:42.6769457Z         D: int,
2025-05-07T20:32:42.6769560Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6769656Z         contiguous: bool,
2025-05-07T20:32:42.6769747Z         compiled: bool,
2025-05-07T20:32:42.6769831Z     ) -> None:
2025-05-07T20:32:42.6769934Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6770015Z     
2025-05-07T20:32:42.6770193Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6770275Z     
2025-05-07T20:32:42.6770372Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6770503Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6770601Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6770692Z         x0 = x[:, :D]
2025-05-07T20:32:42.6770775Z         x1 = x[:, D:]
2025-05-07T20:32:42.6770855Z     
2025-05-07T20:32:42.6770949Z         if contiguous:
2025-05-07T20:32:42.6771045Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6771147Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6771228Z     
2025-05-07T20:32:42.6771327Z         if scale_ub is not None:
2025-05-07T20:32:42.6771440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6771587Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6771668Z             )
2025-05-07T20:32:42.6771748Z         else:
2025-05-07T20:32:42.6771889Z             scale_ub_tensor = None
2025-05-07T20:32:42.6771972Z     
2025-05-07T20:32:42.6772108Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6772203Z             op = silu_mul_quant
2025-05-07T20:32:42.6772296Z             if compiled:
2025-05-07T20:32:42.6772401Z                 op = torch.compile(op)
2025-05-07T20:32:42.6772553Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6772630Z     
2025-05-07T20:32:42.6772725Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6772730Z 
2025-05-07T20:32:42.6772842Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6772982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6773090Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6773197Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6773744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6773852Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6774247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6774489Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6774865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6774965Z     kernel = self.compile(
2025-05-07T20:32:42.6775385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6775576Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6775713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6775718Z 
2025-05-07T20:32:42.6775935Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd45b3fa0>
2025-05-07T20:32:42.6776797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6777355Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd45925e0>}
2025-05-07T20:32:42.6778263Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6778466Z context = <triton._C.libtriton.ir.context object at 0x7f3dd44621f0>
2025-05-07T20:32:42.6778470Z 
2025-05-07T20:32:42.6778650Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6778933Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6779049Z                            module_map=module_map)
2025-05-07T20:32:42.6779222Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6779325Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6779406Z E       ^
2025-05-07T20:32:42.6779800Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6779807Z 
2025-05-07T20:32:42.6780266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6780270Z 
2025-05-07T20:32:42.6780384Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6780627Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6780711Z     T=4096,
2025-05-07T20:32:42.6780795Z     D=7168,
2025-05-07T20:32:42.6780884Z     scale_ub=1200.0,
2025-05-07T20:32:42.6780975Z     contiguous=False,
2025-05-07T20:32:42.6781108Z     compiled=False,
2025-05-07T20:32:42.6781183Z )
2025-05-07T20:32:42.6781417Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6781605Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.6781609Z 
2025-05-07T20:32:42.6781691Z     @given(
2025-05-07T20:32:42.6781888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6781991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6782114Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6782245Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6782364Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6782448Z     )
2025-05-07T20:32:42.6782715Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6783020Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6783140Z         self,
2025-05-07T20:32:42.6783238Z         T: int,
2025-05-07T20:32:42.6783323Z         D: int,
2025-05-07T20:32:42.6783429Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6783521Z         contiguous: bool,
2025-05-07T20:32:42.6783609Z         compiled: bool,
2025-05-07T20:32:42.6783690Z     ) -> None:
2025-05-07T20:32:42.6783787Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6783860Z     
2025-05-07T20:32:42.6784043Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6784121Z     
2025-05-07T20:32:42.6784216Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6784354Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6784446Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6784533Z         x0 = x[:, :D]
2025-05-07T20:32:42.6784615Z         x1 = x[:, D:]
2025-05-07T20:32:42.6784689Z     
2025-05-07T20:32:42.6784779Z         if contiguous:
2025-05-07T20:32:42.6784874Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6784967Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6785049Z     
2025-05-07T20:32:42.6785143Z         if scale_ub is not None:
2025-05-07T20:32:42.6785250Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6785391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6785469Z             )
2025-05-07T20:32:42.6785546Z         else:
2025-05-07T20:32:42.6785646Z             scale_ub_tensor = None
2025-05-07T20:32:42.6785725Z     
2025-05-07T20:32:42.6785862Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6785955Z             op = silu_mul_quant
2025-05-07T20:32:42.6786191Z             if compiled:
2025-05-07T20:32:42.6786298Z                 op = torch.compile(op)
2025-05-07T20:32:42.6786407Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6786481Z     
2025-05-07T20:32:42.6786578Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6786582Z 
2025-05-07T20:32:42.6786685Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6786830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6786948Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6787057Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6787679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6787783Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6788216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6788479Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6788888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6788984Z     kernel = self.compile(
2025-05-07T20:32:42.6789452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6789648Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6789965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6789969Z 
2025-05-07T20:32:42.6790185Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd447bdf0>
2025-05-07T20:32:42.6791035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6791660Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd45928b0>}
2025-05-07T20:32:42.6792476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6792685Z context = <triton._C.libtriton.ir.context object at 0x7f3dd42992f0>
2025-05-07T20:32:42.6792689Z 
2025-05-07T20:32:42.6792862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6793144Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6793258Z                            module_map=module_map)
2025-05-07T20:32:42.6793423Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6793527Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6793609Z E       ^
2025-05-07T20:32:42.6793996Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6794001Z 
2025-05-07T20:32:42.6794454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6794458Z 
2025-05-07T20:32:42.6794565Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6794805Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6794884Z     T=16384,
2025-05-07T20:32:42.6794964Z     D=7168,
2025-05-07T20:32:42.6795052Z     scale_ub=None,
2025-05-07T20:32:42.6795141Z     contiguous=True,
2025-05-07T20:32:42.6795228Z     compiled=True,
2025-05-07T20:32:42.6795308Z )
2025-05-07T20:32:42.6795536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6795718Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.6795806Z 
2025-05-07T20:32:42.6795888Z     @given(
2025-05-07T20:32:42.6796011Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6796111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6796224Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6796341Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6796456Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6796533Z     )
2025-05-07T20:32:42.6796790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6796889Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6796968Z         self,
2025-05-07T20:32:42.6797048Z         T: int,
2025-05-07T20:32:42.6797137Z         D: int,
2025-05-07T20:32:42.6797238Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6797330Z         contiguous: bool,
2025-05-07T20:32:42.6797419Z         compiled: bool,
2025-05-07T20:32:42.6797505Z     ) -> None:
2025-05-07T20:32:42.6797612Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6797690Z     
2025-05-07T20:32:42.6797865Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6797942Z     
2025-05-07T20:32:42.6798042Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6798169Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6798263Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6798391Z         x0 = x[:, :D]
2025-05-07T20:32:42.6798474Z         x1 = x[:, D:]
2025-05-07T20:32:42.6798547Z     
2025-05-07T20:32:42.6798634Z         if contiguous:
2025-05-07T20:32:42.6798728Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6798824Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6798901Z     
2025-05-07T20:32:42.6799037Z         if scale_ub is not None:
2025-05-07T20:32:42.6799141Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6799276Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6799360Z             )
2025-05-07T20:32:42.6799438Z         else:
2025-05-07T20:32:42.6799534Z             scale_ub_tensor = None
2025-05-07T20:32:42.6799614Z     
2025-05-07T20:32:42.6799748Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6799839Z             op = silu_mul_quant
2025-05-07T20:32:42.6799931Z             if compiled:
2025-05-07T20:32:42.6800033Z                 op = torch.compile(op)
2025-05-07T20:32:42.6800144Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6800225Z     
2025-05-07T20:32:42.6800318Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6800323Z 
2025-05-07T20:32:42.6800427Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6800564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6800670Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6800777Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6801175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6801272Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6801812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6801912Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6802301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6802540Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6802905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6803003Z     kernel = self.compile(
2025-05-07T20:32:42.6803419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6803599Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6803817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6803822Z 
2025-05-07T20:32:42.6804034Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd4213610>
2025-05-07T20:32:42.6804883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6805433Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd42abc10>}
2025-05-07T20:32:42.6806247Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6806452Z context = <triton._C.libtriton.ir.context object at 0x7f3dd4326430>
2025-05-07T20:32:42.6806456Z 
2025-05-07T20:32:42.6806626Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6806903Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6807011Z                            module_map=module_map)
2025-05-07T20:32:42.6807180Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6807321Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6807398Z E       ^
2025-05-07T20:32:42.6807784Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6807789Z 
2025-05-07T20:32:42.6808238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6808284Z 
2025-05-07T20:32:42.6808392Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6808629Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6808710Z     T=4096,
2025-05-07T20:32:42.6808793Z     D=5120,
2025-05-07T20:32:42.6808877Z     scale_ub=None,
2025-05-07T20:32:42.6808965Z     contiguous=False,
2025-05-07T20:32:42.6809053Z     compiled=True,
2025-05-07T20:32:42.6809131Z )
2025-05-07T20:32:42.6809363Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6809550Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6809554Z 
2025-05-07T20:32:42.6809633Z     @given(
2025-05-07T20:32:42.6809756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6809858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6809979Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6810101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6810217Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6810299Z     )
2025-05-07T20:32:42.6810566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6810661Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6810742Z         self,
2025-05-07T20:32:42.6810827Z         T: int,
2025-05-07T20:32:42.6810911Z         D: int,
2025-05-07T20:32:42.6811014Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6811108Z         contiguous: bool,
2025-05-07T20:32:42.6811201Z         compiled: bool,
2025-05-07T20:32:42.6811287Z     ) -> None:
2025-05-07T20:32:42.6811383Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6811462Z     
2025-05-07T20:32:42.6811644Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6811724Z     
2025-05-07T20:32:42.6811822Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6811955Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6812046Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6812127Z         x0 = x[:, :D]
2025-05-07T20:32:42.6812297Z         x1 = x[:, D:]
2025-05-07T20:32:42.6812375Z     
2025-05-07T20:32:42.6812458Z         if contiguous:
2025-05-07T20:32:42.6812551Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6812644Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6812722Z     
2025-05-07T20:32:42.6812821Z         if scale_ub is not None:
2025-05-07T20:32:42.6812934Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6813076Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6813151Z             )
2025-05-07T20:32:42.6813229Z         else:
2025-05-07T20:32:42.6813329Z             scale_ub_tensor = None
2025-05-07T20:32:42.6813406Z     
2025-05-07T20:32:42.6813541Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6813641Z             op = silu_mul_quant
2025-05-07T20:32:42.6813728Z             if compiled:
2025-05-07T20:32:42.6813829Z                 op = torch.compile(op)
2025-05-07T20:32:42.6813946Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6814022Z     
2025-05-07T20:32:42.6814117Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6814128Z 
2025-05-07T20:32:42.6814226Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6814363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6814467Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6814569Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6815036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6815136Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6815673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6815809Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6816194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6816432Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6816798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6816890Z     kernel = self.compile(
2025-05-07T20:32:42.6817299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6817482Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6817614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6817619Z 
2025-05-07T20:32:42.6817835Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd432e640>
2025-05-07T20:32:42.6818687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6819236Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd427d820>}
2025-05-07T20:32:42.6820055Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6820259Z context = <triton._C.libtriton.ir.context object at 0x7f3dd467f570>
2025-05-07T20:32:42.6820264Z 
2025-05-07T20:32:42.6820441Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6820719Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6820831Z                            module_map=module_map)
2025-05-07T20:32:42.6820999Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6821177Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6821260Z E       ^
2025-05-07T20:32:42.6821640Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6821645Z 
2025-05-07T20:32:42.6822090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6822097Z 
2025-05-07T20:32:42.6822208Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6822439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6822526Z     T=4096,
2025-05-07T20:32:42.6822609Z     D=5120,
2025-05-07T20:32:42.6822696Z     scale_ub=1200.0,
2025-05-07T20:32:42.6822787Z     contiguous=False,
2025-05-07T20:32:42.6822876Z     compiled=False,
2025-05-07T20:32:42.6822953Z )
2025-05-07T20:32:42.6823184Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6823373Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.6823378Z 
2025-05-07T20:32:42.6823459Z     @given(
2025-05-07T20:32:42.6823587Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6823688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6823811Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6823931Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6824091Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6824171Z     )
2025-05-07T20:32:42.6824429Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6824526Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6824618Z         self,
2025-05-07T20:32:42.6824735Z         T: int,
2025-05-07T20:32:42.6824814Z         D: int,
2025-05-07T20:32:42.6824921Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6825013Z         contiguous: bool,
2025-05-07T20:32:42.6825105Z         compiled: bool,
2025-05-07T20:32:42.6825193Z     ) -> None:
2025-05-07T20:32:42.6825289Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6825364Z     
2025-05-07T20:32:42.6825552Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6825632Z     
2025-05-07T20:32:42.6825728Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6825859Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6825954Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6826039Z         x0 = x[:, :D]
2025-05-07T20:32:42.6826122Z         x1 = x[:, D:]
2025-05-07T20:32:42.6826197Z     
2025-05-07T20:32:42.6826281Z         if contiguous:
2025-05-07T20:32:42.6826381Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6826473Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6826557Z     
2025-05-07T20:32:42.6826654Z         if scale_ub is not None:
2025-05-07T20:32:42.6826763Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6826907Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6826986Z             )
2025-05-07T20:32:42.6827065Z         else:
2025-05-07T20:32:42.6827165Z             scale_ub_tensor = None
2025-05-07T20:32:42.6827243Z     
2025-05-07T20:32:42.6827374Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6827468Z             op = silu_mul_quant
2025-05-07T20:32:42.6827554Z             if compiled:
2025-05-07T20:32:42.6827658Z                 op = torch.compile(op)
2025-05-07T20:32:42.6827769Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6827846Z     
2025-05-07T20:32:42.6827940Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6827946Z 
2025-05-07T20:32:42.6828047Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6828181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6828290Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6828393Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6829016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6829117Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6829499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6829802Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6830173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6830267Z     kernel = self.compile(
2025-05-07T20:32:42.6830680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6830860Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6830990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6830994Z 
2025-05-07T20:32:42.6831216Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd432b0a0>
2025-05-07T20:32:42.6832063Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6832614Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4119280>}
2025-05-07T20:32:42.6833472Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6833716Z context = <triton._C.libtriton.ir.context object at 0x7f3dd41138b0>
2025-05-07T20:32:42.6833721Z 
2025-05-07T20:32:42.6833896Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6834175Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6834285Z                            module_map=module_map)
2025-05-07T20:32:42.6834447Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6834544Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6834626Z E       ^
2025-05-07T20:32:42.6835016Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6835021Z 
2025-05-07T20:32:42.6835472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6835476Z 
2025-05-07T20:32:42.6835583Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6835820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6835906Z     T=4096,
2025-05-07T20:32:42.6835989Z     D=5120,
2025-05-07T20:32:42.6836076Z     scale_ub=1200.0,
2025-05-07T20:32:42.6836167Z     contiguous=False,
2025-05-07T20:32:42.6836254Z     compiled=True,
2025-05-07T20:32:42.6836335Z )
2025-05-07T20:32:42.6836569Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6836754Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.6836760Z 
2025-05-07T20:32:42.6836843Z     @given(
2025-05-07T20:32:42.6836964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6837065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6837187Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6837309Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6837427Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6837507Z     )
2025-05-07T20:32:42.6837767Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6837951Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6838027Z         self,
2025-05-07T20:32:42.6838107Z         T: int,
2025-05-07T20:32:42.6838188Z         D: int,
2025-05-07T20:32:42.6838290Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6838380Z         contiguous: bool,
2025-05-07T20:32:42.6838473Z         compiled: bool,
2025-05-07T20:32:42.6838555Z     ) -> None:
2025-05-07T20:32:42.6838654Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6838734Z     
2025-05-07T20:32:42.6838911Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6838988Z     
2025-05-07T20:32:42.6839084Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6839212Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6839311Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6839392Z         x0 = x[:, :D]
2025-05-07T20:32:42.6839473Z         x1 = x[:, D:]
2025-05-07T20:32:42.6839553Z     
2025-05-07T20:32:42.6839641Z         if contiguous:
2025-05-07T20:32:42.6839744Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6839839Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6839915Z     
2025-05-07T20:32:42.6840006Z         if scale_ub is not None:
2025-05-07T20:32:42.6840116Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6840256Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6840332Z             )
2025-05-07T20:32:42.6840458Z         else:
2025-05-07T20:32:42.6840552Z             scale_ub_tensor = None
2025-05-07T20:32:42.6840622Z     
2025-05-07T20:32:42.6840757Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6840846Z             op = silu_mul_quant
2025-05-07T20:32:42.6840932Z             if compiled:
2025-05-07T20:32:42.6841073Z                 op = torch.compile(op)
2025-05-07T20:32:42.6841177Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6841254Z     
2025-05-07T20:32:42.6841345Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6841355Z 
2025-05-07T20:32:42.6841453Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6841588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6841691Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6841793Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6842193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6842292Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6842833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6842931Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6843312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6843552Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6843921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6844021Z     kernel = self.compile(
2025-05-07T20:32:42.6844433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6844615Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6844754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6844758Z 
2025-05-07T20:32:42.6844973Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd419a100>
2025-05-07T20:32:42.6845823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6846480Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4119700>}
2025-05-07T20:32:42.6847292Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6847498Z context = <triton._C.libtriton.ir.context object at 0x7f3dd43ef7b0>
2025-05-07T20:32:42.6847505Z 
2025-05-07T20:32:42.6847680Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6847964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6848073Z                            module_map=module_map)
2025-05-07T20:32:42.6848243Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6848349Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6848430Z E       ^
2025-05-07T20:32:42.6848824Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6848833Z 
2025-05-07T20:32:42.6849282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6849286Z 
2025-05-07T20:32:42.6849391Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6849629Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6849752Z     T=2048,
2025-05-07T20:32:42.6849830Z     D=7168,
2025-05-07T20:32:42.6849919Z     scale_ub=1200.0,
2025-05-07T20:32:42.6850007Z     contiguous=False,
2025-05-07T20:32:42.6850096Z     compiled=False,
2025-05-07T20:32:42.6850176Z )
2025-05-07T20:32:42.6850406Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6850633Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.6850638Z 
2025-05-07T20:32:42.6850720Z     @given(
2025-05-07T20:32:42.6850850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6850957Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6851074Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6851193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6851312Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6851392Z     )
2025-05-07T20:32:42.6851657Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6851756Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6851836Z         self,
2025-05-07T20:32:42.6851919Z         T: int,
2025-05-07T20:32:42.6851998Z         D: int,
2025-05-07T20:32:42.6852101Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6852199Z         contiguous: bool,
2025-05-07T20:32:42.6852287Z         compiled: bool,
2025-05-07T20:32:42.6852365Z     ) -> None:
2025-05-07T20:32:42.6852467Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6852550Z     
2025-05-07T20:32:42.6852726Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6852808Z     
2025-05-07T20:32:42.6852902Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6853029Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6853124Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6853206Z         x0 = x[:, :D]
2025-05-07T20:32:42.6853295Z         x1 = x[:, D:]
2025-05-07T20:32:42.6853370Z     
2025-05-07T20:32:42.6853455Z         if contiguous:
2025-05-07T20:32:42.6853550Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6853642Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6853719Z     
2025-05-07T20:32:42.6853818Z         if scale_ub is not None:
2025-05-07T20:32:42.6853931Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6854072Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6854151Z             )
2025-05-07T20:32:42.6854229Z         else:
2025-05-07T20:32:42.6854410Z             scale_ub_tensor = None
2025-05-07T20:32:42.6854491Z     
2025-05-07T20:32:42.6854623Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6854711Z             op = silu_mul_quant
2025-05-07T20:32:42.6854799Z             if compiled:
2025-05-07T20:32:42.6854897Z                 op = torch.compile(op)
2025-05-07T20:32:42.6855010Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6855087Z     
2025-05-07T20:32:42.6855181Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6855185Z 
2025-05-07T20:32:42.6855289Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6855425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6855529Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6855634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6856177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6856281Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6856663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6856894Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6857259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6857394Z     kernel = self.compile(
2025-05-07T20:32:42.6857804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6857989Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6858124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6858169Z 
2025-05-07T20:32:42.6858385Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd416d250>
2025-05-07T20:32:42.6859240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6859791Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd409a790>}
2025-05-07T20:32:42.6860606Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6860802Z context = <triton._C.libtriton.ir.context object at 0x7f3dd414f7f0>
2025-05-07T20:32:42.6860809Z 
2025-05-07T20:32:42.6860982Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6861261Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6861374Z                            module_map=module_map)
2025-05-07T20:32:42.6861538Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6861637Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6861718Z E       ^
2025-05-07T20:32:42.6862097Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6862105Z 
2025-05-07T20:32:42.6862548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6862553Z 
2025-05-07T20:32:42.6862656Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6862887Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6862978Z     T=1,
2025-05-07T20:32:42.6863060Z     D=7168,
2025-05-07T20:32:42.6863143Z     scale_ub=None,
2025-05-07T20:32:42.6863233Z     contiguous=True,
2025-05-07T20:32:42.6863401Z     compiled=False,
2025-05-07T20:32:42.6863481Z )
2025-05-07T20:32:42.6863710Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6863878Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.6863883Z 
2025-05-07T20:32:42.6863962Z     @given(
2025-05-07T20:32:42.6864088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6864193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6864316Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6864434Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6864550Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6864630Z     )
2025-05-07T20:32:42.6864888Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6864988Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6865074Z         self,
2025-05-07T20:32:42.6865155Z         T: int,
2025-05-07T20:32:42.6865244Z         D: int,
2025-05-07T20:32:42.6865350Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6865441Z         contiguous: bool,
2025-05-07T20:32:42.6865529Z         compiled: bool,
2025-05-07T20:32:42.6865614Z     ) -> None:
2025-05-07T20:32:42.6865711Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6865789Z     
2025-05-07T20:32:42.6865967Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6866082Z     
2025-05-07T20:32:42.6866177Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6866299Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6866393Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6866477Z         x0 = x[:, :D]
2025-05-07T20:32:42.6866559Z         x1 = x[:, D:]
2025-05-07T20:32:42.6866694Z     
2025-05-07T20:32:42.6866796Z         if contiguous:
2025-05-07T20:32:42.6866904Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6866995Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6867082Z     
2025-05-07T20:32:42.6867176Z         if scale_ub is not None:
2025-05-07T20:32:42.6867288Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6867425Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6867504Z             )
2025-05-07T20:32:42.6867587Z         else:
2025-05-07T20:32:42.6867682Z             scale_ub_tensor = None
2025-05-07T20:32:42.6867761Z     
2025-05-07T20:32:42.6867898Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6867993Z             op = silu_mul_quant
2025-05-07T20:32:42.6868079Z             if compiled:
2025-05-07T20:32:42.6868185Z                 op = torch.compile(op)
2025-05-07T20:32:42.6868292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6868370Z     
2025-05-07T20:32:42.6868465Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6868470Z 
2025-05-07T20:32:42.6868568Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6868711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6868814Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6868916Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6869465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6869566Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6870071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6870315Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6870682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6870783Z     kernel = self.compile(
2025-05-07T20:32:42.6871196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6871461Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6871598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6871602Z 
2025-05-07T20:32:42.6871815Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd40abe20>
2025-05-07T20:32:42.6872664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6873215Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd40d40d0>}
2025-05-07T20:32:42.6874030Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6874233Z context = <triton._C.libtriton.ir.context object at 0x7f3dd40d0fb0>
2025-05-07T20:32:42.6874237Z 
2025-05-07T20:32:42.6874406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6874685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6874795Z                            module_map=module_map)
2025-05-07T20:32:42.6875000Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6875104Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6875184Z E       ^
2025-05-07T20:32:42.6875569Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6875617Z 
2025-05-07T20:32:42.6876065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6876070Z 
2025-05-07T20:32:42.6876175Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6876411Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6876493Z     T=16384,
2025-05-07T20:32:42.6876573Z     D=7168,
2025-05-07T20:32:42.6876663Z     scale_ub=1200.0,
2025-05-07T20:32:42.6876750Z     contiguous=False,
2025-05-07T20:32:42.6876837Z     compiled=True,
2025-05-07T20:32:42.6876916Z )
2025-05-07T20:32:42.6877146Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6877337Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.6877341Z 
2025-05-07T20:32:42.6877423Z     @given(
2025-05-07T20:32:42.6877544Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6877649Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6877771Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6877889Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6878012Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6878089Z     )
2025-05-07T20:32:42.6878353Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6878449Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6878530Z         self,
2025-05-07T20:32:42.6878614Z         T: int,
2025-05-07T20:32:42.6878695Z         D: int,
2025-05-07T20:32:42.6878797Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6878894Z         contiguous: bool,
2025-05-07T20:32:42.6878981Z         compiled: bool,
2025-05-07T20:32:42.6879062Z     ) -> None:
2025-05-07T20:32:42.6879160Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6879238Z     
2025-05-07T20:32:42.6879415Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6879497Z     
2025-05-07T20:32:42.6879592Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6879721Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6879923Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6880006Z         x0 = x[:, :D]
2025-05-07T20:32:42.6880085Z         x1 = x[:, D:]
2025-05-07T20:32:42.6880157Z     
2025-05-07T20:32:42.6880243Z         if contiguous:
2025-05-07T20:32:42.6880340Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6880431Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6880508Z     
2025-05-07T20:32:42.6880608Z         if scale_ub is not None:
2025-05-07T20:32:42.6880723Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6880864Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6880948Z             )
2025-05-07T20:32:42.6881028Z         else:
2025-05-07T20:32:42.6881123Z             scale_ub_tensor = None
2025-05-07T20:32:42.6881202Z     
2025-05-07T20:32:42.6881336Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6881431Z             op = silu_mul_quant
2025-05-07T20:32:42.6881521Z             if compiled:
2025-05-07T20:32:42.6881628Z                 op = torch.compile(op)
2025-05-07T20:32:42.6881741Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6881814Z     
2025-05-07T20:32:42.6881908Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6881913Z 
2025-05-07T20:32:42.6882016Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6882149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6882252Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6882403Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6882988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6883132Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6883688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6883884Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6884274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6884508Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6884874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6884974Z     kernel = self.compile(
2025-05-07T20:32:42.6885384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6885572Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6885705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6885710Z 
2025-05-07T20:32:42.6885924Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3ebf490>
2025-05-07T20:32:42.6886779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6887332Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd40d4d30>}
2025-05-07T20:32:42.6888149Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6888349Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3f9edf0>
2025-05-07T20:32:42.6888353Z 
2025-05-07T20:32:42.6888526Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6888804Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6888911Z                            module_map=module_map)
2025-05-07T20:32:42.6889195Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6889299Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6889384Z E       ^
2025-05-07T20:32:42.6889773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6889778Z 
2025-05-07T20:32:42.6890229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6890237Z 
2025-05-07T20:32:42.6890344Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6894456Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6894556Z     T=1,
2025-05-07T20:32:42.6894634Z     D=7168,
2025-05-07T20:32:42.6894716Z     scale_ub=None,
2025-05-07T20:32:42.6894813Z     contiguous=False,
2025-05-07T20:32:42.6894897Z     compiled=False,
2025-05-07T20:32:42.6894970Z )
2025-05-07T20:32:42.6895212Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6895390Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.6895395Z 
2025-05-07T20:32:42.6895472Z     @given(
2025-05-07T20:32:42.6895596Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6895698Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6895815Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6896028Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6896140Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6896219Z     )
2025-05-07T20:32:42.6896485Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6896587Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6896708Z         self,
2025-05-07T20:32:42.6896786Z         T: int,
2025-05-07T20:32:42.6896867Z         D: int,
2025-05-07T20:32:42.6896970Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6897066Z         contiguous: bool,
2025-05-07T20:32:42.6897155Z         compiled: bool,
2025-05-07T20:32:42.6897237Z     ) -> None:
2025-05-07T20:32:42.6897336Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6897418Z     
2025-05-07T20:32:42.6897593Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6897667Z     
2025-05-07T20:32:42.6897764Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6897892Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6897986Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6898073Z         x0 = x[:, :D]
2025-05-07T20:32:42.6898150Z         x1 = x[:, D:]
2025-05-07T20:32:42.6898226Z     
2025-05-07T20:32:42.6898309Z         if contiguous:
2025-05-07T20:32:42.6898402Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6898500Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6898570Z     
2025-05-07T20:32:42.6898657Z         if scale_ub is not None:
2025-05-07T20:32:42.6898767Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6898906Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6898982Z             )
2025-05-07T20:32:42.6899063Z         else:
2025-05-07T20:32:42.6899156Z             scale_ub_tensor = None
2025-05-07T20:32:42.6899229Z     
2025-05-07T20:32:42.6899363Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6899458Z             op = silu_mul_quant
2025-05-07T20:32:42.6899551Z             if compiled:
2025-05-07T20:32:42.6899654Z                 op = torch.compile(op)
2025-05-07T20:32:42.6899761Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6899841Z     
2025-05-07T20:32:42.6899938Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6899943Z 
2025-05-07T20:32:42.6900041Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6900181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6900290Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6900475Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6901031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6901137Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6901526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6901762Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6902128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6902227Z     kernel = self.compile(
2025-05-07T20:32:42.6902642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6902824Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6902965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6902970Z 
2025-05-07T20:32:42.6903181Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3e8d940>
2025-05-07T20:32:42.6904031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6904624Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd4151700>}
2025-05-07T20:32:42.6905445Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6905682Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3edc830>
2025-05-07T20:32:42.6905687Z 
2025-05-07T20:32:42.6905864Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6906153Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6906263Z                            module_map=module_map)
2025-05-07T20:32:42.6906436Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6906561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6906655Z E       ^
2025-05-07T20:32:42.6907064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6907069Z 
2025-05-07T20:32:42.6907520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6907529Z 
2025-05-07T20:32:42.6907630Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6907866Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6907952Z     T=2048,
2025-05-07T20:32:42.6908033Z     D=7168,
2025-05-07T20:32:42.6908118Z     scale_ub=None,
2025-05-07T20:32:42.6908207Z     contiguous=False,
2025-05-07T20:32:42.6908295Z     compiled=True,
2025-05-07T20:32:42.6908373Z )
2025-05-07T20:32:42.6908598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6908782Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6908789Z 
2025-05-07T20:32:42.6908868Z     @given(
2025-05-07T20:32:42.6908990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6909098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6909216Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6909344Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6909465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6909542Z     )
2025-05-07T20:32:42.6909969Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6910069Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6910150Z         self,
2025-05-07T20:32:42.6910234Z         T: int,
2025-05-07T20:32:42.6910312Z         D: int,
2025-05-07T20:32:42.6910413Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6910507Z         contiguous: bool,
2025-05-07T20:32:42.6910594Z         compiled: bool,
2025-05-07T20:32:42.6910679Z     ) -> None:
2025-05-07T20:32:42.6910777Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6910852Z     
2025-05-07T20:32:42.6911036Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6911115Z     
2025-05-07T20:32:42.6911210Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6911342Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6911436Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6911519Z         x0 = x[:, :D]
2025-05-07T20:32:42.6911606Z         x1 = x[:, D:]
2025-05-07T20:32:42.6911679Z     
2025-05-07T20:32:42.6911768Z         if contiguous:
2025-05-07T20:32:42.6911866Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6911958Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6912036Z     
2025-05-07T20:32:42.6912131Z         if scale_ub is not None:
2025-05-07T20:32:42.6912241Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6912379Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6912501Z             )
2025-05-07T20:32:42.6912575Z         else:
2025-05-07T20:32:42.6912670Z             scale_ub_tensor = None
2025-05-07T20:32:42.6912743Z     
2025-05-07T20:32:42.6912872Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6912967Z             op = silu_mul_quant
2025-05-07T20:32:42.6913050Z             if compiled:
2025-05-07T20:32:42.6913218Z                 op = torch.compile(op)
2025-05-07T20:32:42.6913326Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6913396Z     
2025-05-07T20:32:42.6913490Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6913494Z 
2025-05-07T20:32:42.6913593Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6913725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6913831Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6913928Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6914320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6914417Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6914953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6915052Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6915445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6915685Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6916052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6916148Z     kernel = self.compile(
2025-05-07T20:32:42.6916561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6916744Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6916882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6916886Z 
2025-05-07T20:32:42.6917109Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3f034f0>
2025-05-07T20:32:42.6917957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6918589Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd41de3a0>}
2025-05-07T20:32:42.6919414Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6919613Z context = <triton._C.libtriton.ir.context object at 0x7f3dd41e1b30>
2025-05-07T20:32:42.6919620Z 
2025-05-07T20:32:42.6919794Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6920072Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6920182Z                            module_map=module_map)
2025-05-07T20:32:42.6920349Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6920446Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6920526Z E       ^
2025-05-07T20:32:42.6920919Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6920924Z 
2025-05-07T20:32:42.6921373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6921377Z 
2025-05-07T20:32:42.6921487Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6921757Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6921839Z     T=4096,
2025-05-07T20:32:42.6921918Z     D=7168,
2025-05-07T20:32:42.6922004Z     scale_ub=None,
2025-05-07T20:32:42.6922095Z     contiguous=False,
2025-05-07T20:32:42.6922182Z     compiled=True,
2025-05-07T20:32:42.6922258Z )
2025-05-07T20:32:42.6922527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6922706Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6922710Z 
2025-05-07T20:32:42.6922789Z     @given(
2025-05-07T20:32:42.6922911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6923008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6923122Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6923240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6923355Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6923434Z     )
2025-05-07T20:32:42.6923693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6923786Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6923864Z         self,
2025-05-07T20:32:42.6923944Z         T: int,
2025-05-07T20:32:42.6924024Z         D: int,
2025-05-07T20:32:42.6924131Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6924220Z         contiguous: bool,
2025-05-07T20:32:42.6924306Z         compiled: bool,
2025-05-07T20:32:42.6924389Z     ) -> None:
2025-05-07T20:32:42.6924490Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6924564Z     
2025-05-07T20:32:42.6924741Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6924819Z     
2025-05-07T20:32:42.6924913Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6925042Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6925132Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6925219Z         x0 = x[:, :D]
2025-05-07T20:32:42.6925304Z         x1 = x[:, D:]
2025-05-07T20:32:42.6925376Z     
2025-05-07T20:32:42.6925468Z         if contiguous:
2025-05-07T20:32:42.6925562Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6925653Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6925735Z     
2025-05-07T20:32:42.6925831Z         if scale_ub is not None:
2025-05-07T20:32:42.6925942Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6926087Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6926163Z             )
2025-05-07T20:32:42.6926323Z         else:
2025-05-07T20:32:42.6926426Z             scale_ub_tensor = None
2025-05-07T20:32:42.6926499Z     
2025-05-07T20:32:42.6926632Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6926722Z             op = silu_mul_quant
2025-05-07T20:32:42.6926806Z             if compiled:
2025-05-07T20:32:42.6926912Z                 op = torch.compile(op)
2025-05-07T20:32:42.6927020Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6927096Z     
2025-05-07T20:32:42.6927189Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6927193Z 
2025-05-07T20:32:42.6927292Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6927425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6927530Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6927633Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6928034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6928129Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6928667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6928765Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6929150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6929423Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6929788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6929880Z     kernel = self.compile(
2025-05-07T20:32:42.6930290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6930509Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6930643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6930647Z 
2025-05-07T20:32:42.6930863Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3e4aa30>
2025-05-07T20:32:42.6931712Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6932269Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd41de700>}
2025-05-07T20:32:42.6933081Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6933285Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3e8b130>
2025-05-07T20:32:42.6933293Z 
2025-05-07T20:32:42.6933462Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6933740Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6933853Z                            module_map=module_map)
2025-05-07T20:32:42.6934019Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6934122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6934204Z E       ^
2025-05-07T20:32:42.6934588Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6934592Z 
2025-05-07T20:32:42.6935045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6935053Z 
2025-05-07T20:32:42.6935158Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6935472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6935552Z     T=16384,
2025-05-07T20:32:42.6935628Z     D=5120,
2025-05-07T20:32:42.6935715Z     scale_ub=1200.0,
2025-05-07T20:32:42.6935806Z     contiguous=False,
2025-05-07T20:32:42.6935895Z     compiled=False,
2025-05-07T20:32:42.6935970Z )
2025-05-07T20:32:42.6936202Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6936393Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.6936398Z 
2025-05-07T20:32:42.6936479Z     @given(
2025-05-07T20:32:42.6936603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6936701Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6936822Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6936944Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6937059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6937141Z     )
2025-05-07T20:32:42.6937408Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6937507Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6937583Z         self,
2025-05-07T20:32:42.6937662Z         T: int,
2025-05-07T20:32:42.6937744Z         D: int,
2025-05-07T20:32:42.6937845Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6937936Z         contiguous: bool,
2025-05-07T20:32:42.6938065Z         compiled: bool,
2025-05-07T20:32:42.6938145Z     ) -> None:
2025-05-07T20:32:42.6938242Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6938323Z     
2025-05-07T20:32:42.6938499Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6938578Z     
2025-05-07T20:32:42.6938676Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6938840Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6938934Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6939012Z         x0 = x[:, :D]
2025-05-07T20:32:42.6939097Z         x1 = x[:, D:]
2025-05-07T20:32:42.6939174Z     
2025-05-07T20:32:42.6939258Z         if contiguous:
2025-05-07T20:32:42.6939352Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6939444Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6939519Z     
2025-05-07T20:32:42.6939609Z         if scale_ub is not None:
2025-05-07T20:32:42.6939715Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6939851Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6939927Z             )
2025-05-07T20:32:42.6940006Z         else:
2025-05-07T20:32:42.6940102Z             scale_ub_tensor = None
2025-05-07T20:32:42.6940179Z     
2025-05-07T20:32:42.6940315Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6940408Z             op = silu_mul_quant
2025-05-07T20:32:42.6940499Z             if compiled:
2025-05-07T20:32:42.6940602Z                 op = torch.compile(op)
2025-05-07T20:32:42.6940710Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6940792Z     
2025-05-07T20:32:42.6940885Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6940889Z 
2025-05-07T20:32:42.6940988Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6941131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6941235Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6941336Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6941883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6941985Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6942377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6942617Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6943063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6943161Z     kernel = self.compile(
2025-05-07T20:32:42.6943573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6943755Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6943885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6943892Z 
2025-05-07T20:32:42.6944103Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3dfb2e0>
2025-05-07T20:32:42.6944955Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6945506Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3e83790>}
2025-05-07T20:32:42.6946325Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6946521Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3ceed30>
2025-05-07T20:32:42.6946526Z 
2025-05-07T20:32:42.6946694Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6947038Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6947147Z                            module_map=module_map)
2025-05-07T20:32:42.6947311Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6947447Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6947527Z E       ^
2025-05-07T20:32:42.6947917Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6947926Z 
2025-05-07T20:32:42.6948374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6948379Z 
2025-05-07T20:32:42.6948485Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6948719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6948801Z     T=16384,
2025-05-07T20:32:42.6948888Z     D=5120,
2025-05-07T20:32:42.6948973Z     scale_ub=1200.0,
2025-05-07T20:32:42.6949059Z     contiguous=True,
2025-05-07T20:32:42.6949146Z     compiled=True,
2025-05-07T20:32:42.6949219Z )
2025-05-07T20:32:42.6949449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6949636Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.6949644Z 
2025-05-07T20:32:42.6949770Z     @given(
2025-05-07T20:32:42.6949895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6950003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6950121Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6950242Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6950356Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6950435Z     )
2025-05-07T20:32:42.6950701Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6950798Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6950874Z         self,
2025-05-07T20:32:42.6950957Z         T: int,
2025-05-07T20:32:42.6951034Z         D: int,
2025-05-07T20:32:42.6951134Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6951228Z         contiguous: bool,
2025-05-07T20:32:42.6951315Z         compiled: bool,
2025-05-07T20:32:42.6951399Z     ) -> None:
2025-05-07T20:32:42.6951495Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6951573Z     
2025-05-07T20:32:42.6951833Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6951911Z     
2025-05-07T20:32:42.6952000Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6952130Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6952221Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6952303Z         x0 = x[:, :D]
2025-05-07T20:32:42.6952391Z         x1 = x[:, D:]
2025-05-07T20:32:42.6952467Z     
2025-05-07T20:32:42.6952554Z         if contiguous:
2025-05-07T20:32:42.6952655Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6952749Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6952828Z     
2025-05-07T20:32:42.6952924Z         if scale_ub is not None:
2025-05-07T20:32:42.6953032Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6953175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6953256Z             )
2025-05-07T20:32:42.6953333Z         else:
2025-05-07T20:32:42.6953430Z             scale_ub_tensor = None
2025-05-07T20:32:42.6953517Z     
2025-05-07T20:32:42.6953657Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6953747Z             op = silu_mul_quant
2025-05-07T20:32:42.6953839Z             if compiled:
2025-05-07T20:32:42.6953945Z                 op = torch.compile(op)
2025-05-07T20:32:42.6954056Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6954129Z     
2025-05-07T20:32:42.6954222Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6954268Z 
2025-05-07T20:32:42.6954369Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6954501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6954604Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6954708Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6955100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6955232Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6955773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6955871Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6956254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6956487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6956854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6956951Z     kernel = self.compile(
2025-05-07T20:32:42.6957363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6957547Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6957685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6957689Z 
2025-05-07T20:32:42.6957907Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3d07e20>
2025-05-07T20:32:42.6958757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6959311Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3d91550>}
2025-05-07T20:32:42.6960133Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6960334Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3c9ef30>
2025-05-07T20:32:42.6960338Z 
2025-05-07T20:32:42.6960510Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6960874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6960981Z                            module_map=module_map)
2025-05-07T20:32:42.6961145Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6961243Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6961321Z E       ^
2025-05-07T20:32:42.6961710Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6961717Z 
2025-05-07T20:32:42.6962166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6962170Z 
2025-05-07T20:32:42.6962278Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6962515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6962597Z     T=16384,
2025-05-07T20:32:42.6962680Z     D=5120,
2025-05-07T20:32:42.6962770Z     scale_ub=None,
2025-05-07T20:32:42.6962858Z     contiguous=False,
2025-05-07T20:32:42.6962945Z     compiled=True,
2025-05-07T20:32:42.6963021Z )
2025-05-07T20:32:42.6963250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6963439Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6963443Z 
2025-05-07T20:32:42.6963561Z     @given(
2025-05-07T20:32:42.6963681Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6963779Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6963894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6964013Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6964125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6964242Z     )
2025-05-07T20:32:42.6964507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6964605Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6964684Z         self,
2025-05-07T20:32:42.6964765Z         T: int,
2025-05-07T20:32:42.6964844Z         D: int,
2025-05-07T20:32:42.6964945Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6965042Z         contiguous: bool,
2025-05-07T20:32:42.6965130Z         compiled: bool,
2025-05-07T20:32:42.6965212Z     ) -> None:
2025-05-07T20:32:42.6965309Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6965389Z     
2025-05-07T20:32:42.6965568Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6965647Z     
2025-05-07T20:32:42.6965742Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6965871Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6965962Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6966047Z         x0 = x[:, :D]
2025-05-07T20:32:42.6966132Z         x1 = x[:, D:]
2025-05-07T20:32:42.6966203Z     
2025-05-07T20:32:42.6966288Z         if contiguous:
2025-05-07T20:32:42.6966388Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6966480Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6966560Z     
2025-05-07T20:32:42.6966657Z         if scale_ub is not None:
2025-05-07T20:32:42.6966764Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6966904Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6966982Z             )
2025-05-07T20:32:42.6967058Z         else:
2025-05-07T20:32:42.6967157Z             scale_ub_tensor = None
2025-05-07T20:32:42.6967233Z     
2025-05-07T20:32:42.6967364Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6967459Z             op = silu_mul_quant
2025-05-07T20:32:42.6967545Z             if compiled:
2025-05-07T20:32:42.6967647Z                 op = torch.compile(op)
2025-05-07T20:32:42.6967762Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6967838Z     
2025-05-07T20:32:42.6967931Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6967938Z 
2025-05-07T20:32:42.6968120Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6968256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6968357Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6968456Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6968849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6968950Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6969485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6969590Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6969975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6970215Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6970588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6970683Z     kernel = self.compile(
2025-05-07T20:32:42.6971095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6971282Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6971415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6971460Z 
2025-05-07T20:32:42.6971679Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3e047f0>
2025-05-07T20:32:42.6972525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6973119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3d910d0>}
2025-05-07T20:32:42.6973937Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6974135Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3d15b30>
2025-05-07T20:32:42.6974143Z 
2025-05-07T20:32:42.6974314Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6974592Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6974699Z                            module_map=module_map)
2025-05-07T20:32:42.6974863Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6974965Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6975046Z E       ^
2025-05-07T20:32:42.6975432Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6975436Z 
2025-05-07T20:32:42.6975881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6975886Z 
2025-05-07T20:32:42.6975989Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6976220Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6976304Z     T=2048,
2025-05-07T20:32:42.6976385Z     D=5120,
2025-05-07T20:32:42.6976470Z     scale_ub=None,
2025-05-07T20:32:42.6976560Z     contiguous=False,
2025-05-07T20:32:42.6976646Z     compiled=True,
2025-05-07T20:32:42.6976723Z )
2025-05-07T20:32:42.6976956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6977140Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.6977144Z 
2025-05-07T20:32:42.6977225Z     @given(
2025-05-07T20:32:42.6977460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6977561Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6977680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6977794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6977906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6977985Z     )
2025-05-07T20:32:42.6978242Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6978335Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6978413Z         self,
2025-05-07T20:32:42.6978492Z         T: int,
2025-05-07T20:32:42.6978572Z         D: int,
2025-05-07T20:32:42.6978676Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6978767Z         contiguous: bool,
2025-05-07T20:32:42.6978858Z         compiled: bool,
2025-05-07T20:32:42.6978943Z     ) -> None:
2025-05-07T20:32:42.6979039Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6979119Z     
2025-05-07T20:32:42.6979299Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6979379Z     
2025-05-07T20:32:42.6979476Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6979603Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6979696Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6979781Z         x0 = x[:, :D]
2025-05-07T20:32:42.6979866Z         x1 = x[:, D:]
2025-05-07T20:32:42.6979982Z     
2025-05-07T20:32:42.6980068Z         if contiguous:
2025-05-07T20:32:42.6980161Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6980249Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6980326Z     
2025-05-07T20:32:42.6980420Z         if scale_ub is not None:
2025-05-07T20:32:42.6980526Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6980706Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6980779Z             )
2025-05-07T20:32:42.6980858Z         else:
2025-05-07T20:32:42.6980961Z             scale_ub_tensor = None
2025-05-07T20:32:42.6981033Z     
2025-05-07T20:32:42.6981167Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6981258Z             op = silu_mul_quant
2025-05-07T20:32:42.6981345Z             if compiled:
2025-05-07T20:32:42.6981451Z                 op = torch.compile(op)
2025-05-07T20:32:42.6981559Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6981631Z     
2025-05-07T20:32:42.6981729Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6981734Z 
2025-05-07T20:32:42.6981834Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6981972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6982077Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6982179Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6982576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6982667Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6983485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6983595Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6983983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6984220Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6984587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6984679Z     kernel = self.compile(
2025-05-07T20:32:42.6985093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6985276Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6985408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6985558Z 
2025-05-07T20:32:42.6985794Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3cba940>
2025-05-07T20:32:42.6986784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.6987424Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3d14af0>}
2025-05-07T20:32:42.6988366Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.6988588Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3f61c30>
2025-05-07T20:32:42.6988593Z 
2025-05-07T20:32:42.6988783Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.6989095Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6989210Z                            module_map=module_map)
2025-05-07T20:32:42.6989389Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6989495Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6989633Z E       ^
2025-05-07T20:32:42.6990081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6990086Z 
2025-05-07T20:32:42.6990536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.6990602Z 
2025-05-07T20:32:42.6990706Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.6990939Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.6991024Z     T=2048,
2025-05-07T20:32:42.6991109Z     D=5120,
2025-05-07T20:32:42.6991198Z     scale_ub=1200.0,
2025-05-07T20:32:42.6991291Z     contiguous=False,
2025-05-07T20:32:42.6991380Z     compiled=True,
2025-05-07T20:32:42.6991468Z )
2025-05-07T20:32:42.6991697Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.6991883Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.6991890Z 
2025-05-07T20:32:42.6991970Z     @given(
2025-05-07T20:32:42.6992092Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.6992193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.6992315Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.6992436Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.6992559Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.6992639Z     )
2025-05-07T20:32:42.6992903Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.6993007Z     def test_silu_mul_quant(
2025-05-07T20:32:42.6993088Z         self,
2025-05-07T20:32:42.6993169Z         T: int,
2025-05-07T20:32:42.6993251Z         D: int,
2025-05-07T20:32:42.6993354Z         scale_ub: Optional[float],
2025-05-07T20:32:42.6993446Z         contiguous: bool,
2025-05-07T20:32:42.6993538Z         compiled: bool,
2025-05-07T20:32:42.6993619Z     ) -> None:
2025-05-07T20:32:42.6993720Z         torch.manual_seed(2025)
2025-05-07T20:32:42.6993800Z     
2025-05-07T20:32:42.6993975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.6994056Z     
2025-05-07T20:32:42.6994151Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.6994278Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.6994377Z         x = x_sign * x_clamp
2025-05-07T20:32:42.6994460Z         x0 = x[:, :D]
2025-05-07T20:32:42.6994544Z         x1 = x[:, D:]
2025-05-07T20:32:42.6994622Z     
2025-05-07T20:32:42.6994793Z         if contiguous:
2025-05-07T20:32:42.6994885Z             x0 = x0.contiguous()
2025-05-07T20:32:42.6994977Z             x1 = x1.contiguous()
2025-05-07T20:32:42.6995048Z     
2025-05-07T20:32:42.6995135Z         if scale_ub is not None:
2025-05-07T20:32:42.6995247Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.6995385Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.6995466Z             )
2025-05-07T20:32:42.6995550Z         else:
2025-05-07T20:32:42.6995648Z             scale_ub_tensor = None
2025-05-07T20:32:42.6995727Z     
2025-05-07T20:32:42.6995859Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.6995952Z             op = silu_mul_quant
2025-05-07T20:32:42.6996040Z             if compiled:
2025-05-07T20:32:42.6996146Z                 op = torch.compile(op)
2025-05-07T20:32:42.6996254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6996341Z     
2025-05-07T20:32:42.6996436Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.6996445Z 
2025-05-07T20:32:42.6996543Z moe/activation_test.py:117: 
2025-05-07T20:32:42.6996682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6996786Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.6996890Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.6997283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.6997419Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.6997961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.6998062Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.6998483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.6998720Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.6999091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.6999190Z     kernel = self.compile(
2025-05-07T20:32:42.6999603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.6999785Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6999925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.6999929Z 
2025-05-07T20:32:42.7000143Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3b7f520>
2025-05-07T20:32:42.7000998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7001563Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3adf820>}
2025-05-07T20:32:42.7002382Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7002585Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3d582f0>
2025-05-07T20:32:42.7002592Z 
2025-05-07T20:32:42.7002761Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7003041Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7003148Z                            module_map=module_map)
2025-05-07T20:32:42.7003315Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7003416Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7003495Z E       ^
2025-05-07T20:32:42.7003962Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7003968Z 
2025-05-07T20:32:42.7004416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7004420Z 
2025-05-07T20:32:42.7004522Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7004757Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7004839Z     T=4096,
2025-05-07T20:32:42.7004923Z     D=5120,
2025-05-07T20:32:42.7005012Z     scale_ub=1200.0,
2025-05-07T20:32:42.7005101Z     contiguous=True,
2025-05-07T20:32:42.7005190Z     compiled=True,
2025-05-07T20:32:42.7005265Z )
2025-05-07T20:32:42.7005495Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7005682Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.7005686Z 
2025-05-07T20:32:42.7005775Z     @given(
2025-05-07T20:32:42.7005895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7006002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7006119Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7006240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7006360Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7006484Z     )
2025-05-07T20:32:42.7006748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7006846Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7006925Z         self,
2025-05-07T20:32:42.7007009Z         T: int,
2025-05-07T20:32:42.7007090Z         D: int,
2025-05-07T20:32:42.7007190Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7007321Z         contiguous: bool,
2025-05-07T20:32:42.7007407Z         compiled: bool,
2025-05-07T20:32:42.7007485Z     ) -> None:
2025-05-07T20:32:42.7007583Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7007663Z     
2025-05-07T20:32:42.7007839Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7007923Z     
2025-05-07T20:32:42.7008020Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7008154Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7008247Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7008332Z         x0 = x[:, :D]
2025-05-07T20:32:42.7008422Z         x1 = x[:, D:]
2025-05-07T20:32:42.7008499Z     
2025-05-07T20:32:42.7008587Z         if contiguous:
2025-05-07T20:32:42.7008684Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7008776Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7008855Z     
2025-05-07T20:32:42.7008951Z         if scale_ub is not None:
2025-05-07T20:32:42.7009063Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7009203Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7009288Z             )
2025-05-07T20:32:42.7009368Z         else:
2025-05-07T20:32:42.7009468Z             scale_ub_tensor = None
2025-05-07T20:32:42.7009549Z     
2025-05-07T20:32:42.7009682Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7009779Z             op = silu_mul_quant
2025-05-07T20:32:42.7009867Z             if compiled:
2025-05-07T20:32:42.7009971Z                 op = torch.compile(op)
2025-05-07T20:32:42.7010083Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7010166Z     
2025-05-07T20:32:42.7010262Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7010266Z 
2025-05-07T20:32:42.7010370Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7010505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7010607Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7010714Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7011108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.7011311Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.7011854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7011953Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7012347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7012588Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7012957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7013053Z     kernel = self.compile(
2025-05-07T20:32:42.7013466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7013655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7013794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7013799Z 
2025-05-07T20:32:42.7014016Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3c75b80>
2025-05-07T20:32:42.7014871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7015463Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3bf2430>}
2025-05-07T20:32:42.7016281Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7016518Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3c08a30>
2025-05-07T20:32:42.7016522Z 
2025-05-07T20:32:42.7016698Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7016973Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7017086Z                            module_map=module_map)
2025-05-07T20:32:42.7017255Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7017361Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7017443Z E       ^
2025-05-07T20:32:42.7017831Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7017836Z 
2025-05-07T20:32:42.7018285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7018293Z 
2025-05-07T20:32:42.7018402Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7018642Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7018723Z     T=128,
2025-05-07T20:32:42.7018808Z     D=5120,
2025-05-07T20:32:42.7018895Z     scale_ub=1200.0,
2025-05-07T20:32:42.7018987Z     contiguous=False,
2025-05-07T20:32:42.7019079Z     compiled=True,
2025-05-07T20:32:42.7019157Z )
2025-05-07T20:32:42.7019390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7019574Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.7019582Z 
2025-05-07T20:32:42.7019662Z     @given(
2025-05-07T20:32:42.7019787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7023755Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7023903Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7024034Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7024154Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7024235Z     )
2025-05-07T20:32:42.7024603Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7024706Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7024786Z         self,
2025-05-07T20:32:42.7024866Z         T: int,
2025-05-07T20:32:42.7024948Z         D: int,
2025-05-07T20:32:42.7025052Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7025152Z         contiguous: bool,
2025-05-07T20:32:42.7025245Z         compiled: bool,
2025-05-07T20:32:42.7025335Z     ) -> None:
2025-05-07T20:32:42.7025434Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7025515Z     
2025-05-07T20:32:42.7025698Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7025778Z     
2025-05-07T20:32:42.7025877Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7026009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7026104Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7026196Z         x0 = x[:, :D]
2025-05-07T20:32:42.7026278Z         x1 = x[:, D:]
2025-05-07T20:32:42.7026371Z     
2025-05-07T20:32:42.7026469Z         if contiguous:
2025-05-07T20:32:42.7026580Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7026690Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7026776Z     
2025-05-07T20:32:42.7026869Z         if scale_ub is not None:
2025-05-07T20:32:42.7026984Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7027127Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7027253Z             )
2025-05-07T20:32:42.7027335Z         else:
2025-05-07T20:32:42.7027431Z             scale_ub_tensor = None
2025-05-07T20:32:42.7027507Z     
2025-05-07T20:32:42.7027643Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7027736Z             op = silu_mul_quant
2025-05-07T20:32:42.7027867Z             if compiled:
2025-05-07T20:32:42.7027968Z                 op = torch.compile(op)
2025-05-07T20:32:42.7028075Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7028151Z     
2025-05-07T20:32:42.7028249Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7028254Z 
2025-05-07T20:32:42.7028353Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7028493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7028594Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7028697Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7029105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.7029208Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.7029845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7029948Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7030348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7030595Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7030966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7031063Z     kernel = self.compile(
2025-05-07T20:32:42.7031482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7031668Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7031808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7031813Z 
2025-05-07T20:32:42.7032032Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3bd92b0>
2025-05-07T20:32:42.7032890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7033536Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3a36040>}
2025-05-07T20:32:42.7034357Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7034571Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3a3e7b0>
2025-05-07T20:32:42.7034575Z 
2025-05-07T20:32:42.7034751Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7035040Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7035154Z                            module_map=module_map)
2025-05-07T20:32:42.7035324Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7035429Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7035513Z E       ^
2025-05-07T20:32:42.7035900Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7035904Z 
2025-05-07T20:32:42.7036357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7036362Z 
2025-05-07T20:32:42.7036469Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7036752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7036834Z     T=16384,
2025-05-07T20:32:42.7036912Z     D=7168,
2025-05-07T20:32:42.7037005Z     scale_ub=1200.0,
2025-05-07T20:32:42.7037092Z     contiguous=True,
2025-05-07T20:32:42.7037177Z     compiled=True,
2025-05-07T20:32:42.7037301Z )
2025-05-07T20:32:42.7037527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7037708Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.7037717Z 
2025-05-07T20:32:42.7037803Z     @given(
2025-05-07T20:32:42.7037926Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7038031Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7038150Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7038268Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7038387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7038468Z     )
2025-05-07T20:32:42.7038732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7038834Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7038912Z         self,
2025-05-07T20:32:42.7038994Z         T: int,
2025-05-07T20:32:42.7039081Z         D: int,
2025-05-07T20:32:42.7039186Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7039282Z         contiguous: bool,
2025-05-07T20:32:42.7039375Z         compiled: bool,
2025-05-07T20:32:42.7039459Z     ) -> None:
2025-05-07T20:32:42.7039565Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7039646Z     
2025-05-07T20:32:42.7039826Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7039903Z     
2025-05-07T20:32:42.7039999Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7040128Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7040224Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7040314Z         x0 = x[:, :D]
2025-05-07T20:32:42.7040397Z         x1 = x[:, D:]
2025-05-07T20:32:42.7040477Z     
2025-05-07T20:32:42.7040564Z         if contiguous:
2025-05-07T20:32:42.7040657Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7040754Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7040835Z     
2025-05-07T20:32:42.7040934Z         if scale_ub is not None:
2025-05-07T20:32:42.7041044Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7041184Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7041356Z             )
2025-05-07T20:32:42.7041435Z         else:
2025-05-07T20:32:42.7041530Z             scale_ub_tensor = None
2025-05-07T20:32:42.7041603Z     
2025-05-07T20:32:42.7041735Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7041825Z             op = silu_mul_quant
2025-05-07T20:32:42.7041914Z             if compiled:
2025-05-07T20:32:42.7042016Z                 op = torch.compile(op)
2025-05-07T20:32:42.7042122Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7042200Z     
2025-05-07T20:32:42.7042291Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7042295Z 
2025-05-07T20:32:42.7042396Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7042528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7042636Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7042744Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7043144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.7043242Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.7043788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7043890Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7044280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7044558Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7044920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7045021Z     kernel = self.compile(
2025-05-07T20:32:42.7045496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7045678Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7045820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7045825Z 
2025-05-07T20:32:42.7046039Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3a3c9d0>
2025-05-07T20:32:42.7046891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7047448Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3a36af0>}
2025-05-07T20:32:42.7048268Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7048478Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3a966b0>
2025-05-07T20:32:42.7048483Z 
2025-05-07T20:32:42.7048657Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7048943Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7049054Z                            module_map=module_map)
2025-05-07T20:32:42.7049228Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7049336Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7049420Z E       ^
2025-05-07T20:32:42.7049812Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7049816Z 
2025-05-07T20:32:42.7050270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7050276Z 
2025-05-07T20:32:42.7050386Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7050709Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7050790Z     T=16384,
2025-05-07T20:32:42.7050879Z     D=5120,
2025-05-07T20:32:42.7050966Z     scale_ub=1200.0,
2025-05-07T20:32:42.7051056Z     contiguous=True,
2025-05-07T20:32:42.7051149Z     compiled=False,
2025-05-07T20:32:42.7051225Z )
2025-05-07T20:32:42.7051454Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7051650Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.7051655Z 
2025-05-07T20:32:42.7051736Z     @given(
2025-05-07T20:32:42.7051860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7051965Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7052087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7052210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7052328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7052414Z     )
2025-05-07T20:32:42.7052679Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7052777Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7052858Z         self,
2025-05-07T20:32:42.7052943Z         T: int,
2025-05-07T20:32:42.7053024Z         D: int,
2025-05-07T20:32:42.7053128Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7053226Z         contiguous: bool,
2025-05-07T20:32:42.7053359Z         compiled: bool,
2025-05-07T20:32:42.7053439Z     ) -> None:
2025-05-07T20:32:42.7053537Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7053610Z     
2025-05-07T20:32:42.7053787Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7053862Z     
2025-05-07T20:32:42.7053999Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7054126Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7054215Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7054298Z         x0 = x[:, :D]
2025-05-07T20:32:42.7054389Z         x1 = x[:, D:]
2025-05-07T20:32:42.7054468Z     
2025-05-07T20:32:42.7054553Z         if contiguous:
2025-05-07T20:32:42.7054649Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7054739Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7054812Z     
2025-05-07T20:32:42.7054908Z         if scale_ub is not None:
2025-05-07T20:32:42.7055018Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7055162Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7055241Z             )
2025-05-07T20:32:42.7055323Z         else:
2025-05-07T20:32:42.7055428Z             scale_ub_tensor = None
2025-05-07T20:32:42.7055505Z     
2025-05-07T20:32:42.7055638Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7055737Z             op = silu_mul_quant
2025-05-07T20:32:42.7055830Z             if compiled:
2025-05-07T20:32:42.7055936Z                 op = torch.compile(op)
2025-05-07T20:32:42.7056054Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7056132Z     
2025-05-07T20:32:42.7056224Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7056229Z 
2025-05-07T20:32:42.7056335Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7056475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7056582Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7056685Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7057234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7057338Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7057728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7057968Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7059671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7059770Z     kernel = self.compile(
2025-05-07T20:32:42.7060184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7060362Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7060494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7060501Z 
2025-05-07T20:32:42.7060715Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd39d1fa0>
2025-05-07T20:32:42.7061561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7062119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3a04550>}
2025-05-07T20:32:42.7062930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7063131Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3a07a70>
2025-05-07T20:32:42.7063136Z 
2025-05-07T20:32:42.7063347Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7063623Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7063732Z                            module_map=module_map)
2025-05-07T20:32:42.7063897Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7064040Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7064124Z E       ^
2025-05-07T20:32:42.7064509Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7064514Z 
2025-05-07T20:32:42.7064967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7064972Z 
2025-05-07T20:32:42.7065081Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7065316Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7065396Z     T=1,
2025-05-07T20:32:42.7065480Z     D=7168,
2025-05-07T20:32:42.7065567Z     scale_ub=1200.0,
2025-05-07T20:32:42.7065660Z     contiguous=False,
2025-05-07T20:32:42.7065745Z     compiled=False,
2025-05-07T20:32:42.7065817Z )
2025-05-07T20:32:42.7066052Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7066233Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.7066238Z 
2025-05-07T20:32:42.7066323Z     @given(
2025-05-07T20:32:42.7066454Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7066578Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7066716Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7066844Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7066960Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7067044Z     )
2025-05-07T20:32:42.7067310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7067410Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7067488Z         self,
2025-05-07T20:32:42.7067570Z         T: int,
2025-05-07T20:32:42.7067654Z         D: int,
2025-05-07T20:32:42.7067760Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7067852Z         contiguous: bool,
2025-05-07T20:32:42.7067943Z         compiled: bool,
2025-05-07T20:32:42.7068028Z     ) -> None:
2025-05-07T20:32:42.7068132Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7068213Z     
2025-05-07T20:32:42.7068470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7068551Z     
2025-05-07T20:32:42.7068647Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7068773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7068869Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7068956Z         x0 = x[:, :D]
2025-05-07T20:32:42.7069041Z         x1 = x[:, D:]
2025-05-07T20:32:42.7069123Z     
2025-05-07T20:32:42.7069215Z         if contiguous:
2025-05-07T20:32:42.7069311Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7069407Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7069486Z     
2025-05-07T20:32:42.7069582Z         if scale_ub is not None:
2025-05-07T20:32:42.7069696Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7069907Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7069991Z             )
2025-05-07T20:32:42.7070076Z         else:
2025-05-07T20:32:42.7070174Z             scale_ub_tensor = None
2025-05-07T20:32:42.7070260Z     
2025-05-07T20:32:42.7070400Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7070496Z             op = silu_mul_quant
2025-05-07T20:32:42.7070590Z             if compiled:
2025-05-07T20:32:42.7070695Z                 op = torch.compile(op)
2025-05-07T20:32:42.7070807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7070887Z     
2025-05-07T20:32:42.7070983Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7071036Z 
2025-05-07T20:32:42.7071136Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7071276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7071378Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7071480Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7072065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7072161Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7072555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7072790Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7073154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7073251Z     kernel = self.compile(
2025-05-07T20:32:42.7073667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7073853Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7073989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7073997Z 
2025-05-07T20:32:42.7074213Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3ac20d0>
2025-05-07T20:32:42.7075073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7075632Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3a04820>}
2025-05-07T20:32:42.7076458Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7076665Z context = <triton._C.libtriton.ir.context object at 0x7f3dd397e170>
2025-05-07T20:32:42.7076669Z 
2025-05-07T20:32:42.7076843Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7077132Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7077321Z                            module_map=module_map)
2025-05-07T20:32:42.7077490Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7077594Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7077677Z E       ^
2025-05-07T20:32:42.7078065Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7078070Z 
2025-05-07T20:32:42.7078528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7078533Z 
2025-05-07T20:32:42.7078642Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7078876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7078958Z     T=4096,
2025-05-07T20:32:42.7079047Z     D=7168,
2025-05-07T20:32:42.7079133Z     scale_ub=1200.0,
2025-05-07T20:32:42.7079228Z     contiguous=False,
2025-05-07T20:32:42.7079317Z     compiled=True,
2025-05-07T20:32:42.7079395Z )
2025-05-07T20:32:42.7079633Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7079821Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.7079826Z 
2025-05-07T20:32:42.7079906Z     @given(
2025-05-07T20:32:42.7080033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7080135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7080317Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7080440Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7080554Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7080627Z     )
2025-05-07T20:32:42.7080887Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7081023Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7081100Z         self,
2025-05-07T20:32:42.7081186Z         T: int,
2025-05-07T20:32:42.7081262Z         D: int,
2025-05-07T20:32:42.7081367Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7081464Z         contiguous: bool,
2025-05-07T20:32:42.7081550Z         compiled: bool,
2025-05-07T20:32:42.7081632Z     ) -> None:
2025-05-07T20:32:42.7081725Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7081800Z     
2025-05-07T20:32:42.7081980Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7082055Z     
2025-05-07T20:32:42.7082152Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7082281Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7082369Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7082455Z         x0 = x[:, :D]
2025-05-07T20:32:42.7082542Z         x1 = x[:, D:]
2025-05-07T20:32:42.7082624Z     
2025-05-07T20:32:42.7082715Z         if contiguous:
2025-05-07T20:32:42.7083000Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7083128Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7083241Z     
2025-05-07T20:32:42.7083360Z         if scale_ub is not None:
2025-05-07T20:32:42.7083467Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7083608Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7083683Z             )
2025-05-07T20:32:42.7083761Z         else:
2025-05-07T20:32:42.7083864Z             scale_ub_tensor = None
2025-05-07T20:32:42.7083937Z     
2025-05-07T20:32:42.7084070Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7084167Z             op = silu_mul_quant
2025-05-07T20:32:42.7084252Z             if compiled:
2025-05-07T20:32:42.7084360Z                 op = torch.compile(op)
2025-05-07T20:32:42.7084470Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7084542Z     
2025-05-07T20:32:42.7084637Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7084649Z 
2025-05-07T20:32:42.7084747Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7084877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7085128Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7085241Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7085641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.7085737Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.7086283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7086390Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7086782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7087019Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7087389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7087488Z     kernel = self.compile(
2025-05-07T20:32:42.7087907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7088098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7088233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7088238Z 
2025-05-07T20:32:42.7088451Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3983100>
2025-05-07T20:32:42.7089367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7089920Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3971ca0>}
2025-05-07T20:32:42.7090803Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7091006Z context = <triton._C.libtriton.ir.context object at 0x7f3dd3c17130>
2025-05-07T20:32:42.7091010Z 
2025-05-07T20:32:42.7091184Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7091466Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7091581Z                            module_map=module_map)
2025-05-07T20:32:42.7091746Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7091848Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7091931Z E       ^
2025-05-07T20:32:42.7092321Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7092328Z 
2025-05-07T20:32:42.7092782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7092786Z 
2025-05-07T20:32:42.7092897Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7093132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7093211Z     T=128,
2025-05-07T20:32:42.7093297Z     D=7168,
2025-05-07T20:32:42.7093382Z     scale_ub=1200.0,
2025-05-07T20:32:42.7093476Z     contiguous=False,
2025-05-07T20:32:42.7093563Z     compiled=True,
2025-05-07T20:32:42.7093638Z )
2025-05-07T20:32:42.7093870Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7094055Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:42.7094060Z 
2025-05-07T20:32:42.7094146Z     @given(
2025-05-07T20:32:42.7094272Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7094376Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7094627Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7094751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7094867Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7094940Z     )
2025-05-07T20:32:42.7095201Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7095293Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7095376Z         self,
2025-05-07T20:32:42.7095464Z         T: int,
2025-05-07T20:32:42.7095546Z         D: int,
2025-05-07T20:32:42.7095649Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7095745Z         contiguous: bool,
2025-05-07T20:32:42.7095837Z         compiled: bool,
2025-05-07T20:32:42.7095924Z     ) -> None:
2025-05-07T20:32:42.7096023Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7096107Z     
2025-05-07T20:32:42.7096289Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7096371Z     
2025-05-07T20:32:42.7096467Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7096607Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7096700Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7096784Z         x0 = x[:, :D]
2025-05-07T20:32:42.7096872Z         x1 = x[:, D:]
2025-05-07T20:32:42.7096950Z     
2025-05-07T20:32:42.7097039Z         if contiguous:
2025-05-07T20:32:42.7097135Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7097229Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7097354Z     
2025-05-07T20:32:42.7097447Z         if scale_ub is not None:
2025-05-07T20:32:42.7097553Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7097693Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7097770Z             )
2025-05-07T20:32:42.7097844Z         else:
2025-05-07T20:32:42.7097986Z             scale_ub_tensor = None
2025-05-07T20:32:42.7098062Z     
2025-05-07T20:32:42.7098193Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7098292Z             op = silu_mul_quant
2025-05-07T20:32:42.7098380Z             if compiled:
2025-05-07T20:32:42.7098483Z                 op = torch.compile(op)
2025-05-07T20:32:42.7098596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7098667Z     
2025-05-07T20:32:42.7098757Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7098765Z 
2025-05-07T20:32:42.7098864Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7098999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7099106Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7099210Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7099608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.7099714Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.7100257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7100366Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7100758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7100997Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7101367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7101467Z     kernel = self.compile(
2025-05-07T20:32:42.7101881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7102065Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7102200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7102208Z 
2025-05-07T20:32:42.7102429Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3c4b5b0>
2025-05-07T20:32:42.7103362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7103915Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3e2e8b0>}
2025-05-07T20:32:42.7104742Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7104941Z context = <triton._C.libtriton.ir.context object at 0x7f3dd39041b0>
2025-05-07T20:32:42.7104948Z 
2025-05-07T20:32:42.7105124Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7105401Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7105512Z                            module_map=module_map)
2025-05-07T20:32:42.7105679Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7105778Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7105862Z E       ^
2025-05-07T20:32:42.7106251Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7106296Z 
2025-05-07T20:32:42.7106744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7106749Z 
2025-05-07T20:32:42.7106859Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7107094Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7107214Z     T=2048,
2025-05-07T20:32:42.7107290Z     D=7168,
2025-05-07T20:32:42.7107374Z     scale_ub=None,
2025-05-07T20:32:42.7107462Z     contiguous=True,
2025-05-07T20:32:42.7107553Z     compiled=True,
2025-05-07T20:32:42.7107628Z )
2025-05-07T20:32:42.7107859Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7108040Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.7108045Z 
2025-05-07T20:32:42.7108126Z     @given(
2025-05-07T20:32:42.7108252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7108358Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7108483Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7108603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7108722Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7108805Z     )
2025-05-07T20:32:42.7109068Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7109170Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7109257Z         self,
2025-05-07T20:32:42.7109339Z         T: int,
2025-05-07T20:32:42.7109429Z         D: int,
2025-05-07T20:32:42.7109537Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7109631Z         contiguous: bool,
2025-05-07T20:32:42.7109815Z         compiled: bool,
2025-05-07T20:32:42.7109902Z     ) -> None:
2025-05-07T20:32:42.7109997Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7110082Z     
2025-05-07T20:32:42.7110261Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7110343Z     
2025-05-07T20:32:42.7110440Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7110566Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7110660Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7110746Z         x0 = x[:, :D]
2025-05-07T20:32:42.7110832Z         x1 = x[:, D:]
2025-05-07T20:32:42.7110913Z     
2025-05-07T20:32:42.7111004Z         if contiguous:
2025-05-07T20:32:42.7111099Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7111191Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7111274Z     
2025-05-07T20:32:42.7111475Z         if scale_ub is not None:
2025-05-07T20:32:42.7111584Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7111725Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7111803Z             )
2025-05-07T20:32:42.7111885Z         else:
2025-05-07T20:32:42.7111983Z             scale_ub_tensor = None
2025-05-07T20:32:42.7112061Z     
2025-05-07T20:32:42.7112202Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7112297Z             op = silu_mul_quant
2025-05-07T20:32:42.7112387Z             if compiled:
2025-05-07T20:32:42.7112494Z                 op = torch.compile(op)
2025-05-07T20:32:42.7112603Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7112680Z     
2025-05-07T20:32:42.7112780Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7112784Z 
2025-05-07T20:32:42.7112884Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7113027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7113131Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7113235Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7113639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.7113737Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.7114276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7114418Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7114803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7115038Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7115440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7115539Z     kernel = self.compile(
2025-05-07T20:32:42.7115952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7116133Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7116264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7116271Z 
2025-05-07T20:32:42.7116484Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3c27580>
2025-05-07T20:32:42.7117336Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7117897Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd379e550>}
2025-05-07T20:32:42.7118719Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7118924Z context = <triton._C.libtriton.ir.context object at 0x7f3dd37c12b0>
2025-05-07T20:32:42.7118929Z 
2025-05-07T20:32:42.7119101Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7119383Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7119496Z                            module_map=module_map)
2025-05-07T20:32:42.7119662Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7119769Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7119854Z E       ^
2025-05-07T20:32:42.7120237Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7120241Z 
2025-05-07T20:32:42.7120768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7120773Z 
2025-05-07T20:32:42.7120878Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7121110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7121197Z     T=16384,
2025-05-07T20:32:42.7121283Z     D=5120,
2025-05-07T20:32:42.7121374Z     scale_ub=None,
2025-05-07T20:32:42.7121465Z     contiguous=False,
2025-05-07T20:32:42.7121552Z     compiled=False,
2025-05-07T20:32:42.7121634Z )
2025-05-07T20:32:42.7121866Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7122055Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.7122062Z 
2025-05-07T20:32:42.7122149Z     @given(
2025-05-07T20:32:42.7122273Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7122378Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7122510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7122631Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7122750Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7122833Z     )
2025-05-07T20:32:42.7123096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7123199Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7123325Z         self,
2025-05-07T20:32:42.7123408Z         T: int,
2025-05-07T20:32:42.7123493Z         D: int,
2025-05-07T20:32:42.7123596Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7123688Z         contiguous: bool,
2025-05-07T20:32:42.7123780Z         compiled: bool,
2025-05-07T20:32:42.7123866Z     ) -> None:
2025-05-07T20:32:42.7124001Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7124077Z     
2025-05-07T20:32:42.7124251Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7124329Z     
2025-05-07T20:32:42.7124430Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7124561Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7126564Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7126573Z 
2025-05-07T20:32:42.7126698Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.7126703Z 
2025-05-07T20:32:42.7126811Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7127051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7127131Z     T=4096,
2025-05-07T20:32:42.7127217Z     D=7168,
2025-05-07T20:32:42.7127303Z     scale_ub=1200.0,
2025-05-07T20:32:42.7127394Z     contiguous=True,
2025-05-07T20:32:42.7127485Z     compiled=True,
2025-05-07T20:32:42.7127564Z )
2025-05-07T20:32:42.7127798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7127981Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.7127986Z 
2025-05-07T20:32:42.7128066Z     @given(
2025-05-07T20:32:42.7128189Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7128291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7128410Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7128539Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7128655Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7128734Z     )
2025-05-07T20:32:42.7129082Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7129177Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7129258Z         self,
2025-05-07T20:32:42.7129335Z         T: int,
2025-05-07T20:32:42.7129413Z         D: int,
2025-05-07T20:32:42.7129515Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7129603Z         contiguous: bool,
2025-05-07T20:32:42.7129695Z         compiled: bool,
2025-05-07T20:32:42.7129778Z     ) -> None:
2025-05-07T20:32:42.7129873Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7129951Z     
2025-05-07T20:32:42.7130130Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7130204Z     
2025-05-07T20:32:42.7130295Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7130427Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7132408Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7132455Z 
2025-05-07T20:32:42.7132576Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.7132581Z 
2025-05-07T20:32:42.7132685Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7132917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7133035Z     T=16384,
2025-05-07T20:32:42.7133109Z     D=7168,
2025-05-07T20:32:42.7133195Z     scale_ub=None,
2025-05-07T20:32:42.7133281Z     contiguous=False,
2025-05-07T20:32:42.7133366Z     compiled=False,
2025-05-07T20:32:42.7133444Z )
2025-05-07T20:32:42.7133676Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7133859Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.7133866Z 
2025-05-07T20:32:42.7133944Z     @given(
2025-05-07T20:32:42.7134063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7134165Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7134283Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7134402Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7134522Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7134598Z     )
2025-05-07T20:32:42.7134857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7134957Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7135034Z         self,
2025-05-07T20:32:42.7135111Z         T: int,
2025-05-07T20:32:42.7135192Z         D: int,
2025-05-07T20:32:42.7135297Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7135395Z         contiguous: bool,
2025-05-07T20:32:42.7135484Z         compiled: bool,
2025-05-07T20:32:42.7135564Z     ) -> None:
2025-05-07T20:32:42.7135666Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7135746Z     
2025-05-07T20:32:42.7135920Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7137905Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7137993Z 
2025-05-07T20:32:42.7138114Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7138118Z 
2025-05-07T20:32:42.7138227Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7138456Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7138538Z     T=2048,
2025-05-07T20:32:42.7138624Z     D=7168,
2025-05-07T20:32:42.7138709Z     scale_ub=1200.0,
2025-05-07T20:32:42.7138803Z     contiguous=True,
2025-05-07T20:32:42.7138892Z     compiled=True,
2025-05-07T20:32:42.7138971Z )
2025-05-07T20:32:42.7139204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7139384Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.7139391Z 
2025-05-07T20:32:42.7139473Z     @given(
2025-05-07T20:32:42.7139600Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7139701Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7139825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7139950Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7140067Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7140148Z     )
2025-05-07T20:32:42.7143059Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7143175Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7143314Z         self,
2025-05-07T20:32:42.7143398Z         T: int,
2025-05-07T20:32:42.7143477Z         D: int,
2025-05-07T20:32:42.7143579Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7143677Z         contiguous: bool,
2025-05-07T20:32:42.7143766Z         compiled: bool,
2025-05-07T20:32:42.7143852Z     ) -> None:
2025-05-07T20:32:42.7143950Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7144092Z     
2025-05-07T20:32:42.7144269Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7144344Z     
2025-05-07T20:32:42.7144437Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7144568Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7146549Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7146558Z 
2025-05-07T20:32:42.7146675Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.7146683Z 
2025-05-07T20:32:42.7146797Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7147033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7147120Z     T=2048,
2025-05-07T20:32:42.7147202Z     D=7168,
2025-05-07T20:32:42.7147289Z     scale_ub=None,
2025-05-07T20:32:42.7147380Z     contiguous=True,
2025-05-07T20:32:42.7147467Z     compiled=False,
2025-05-07T20:32:42.7147545Z )
2025-05-07T20:32:42.7147785Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7147967Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.7147974Z 
2025-05-07T20:32:42.7148058Z     @given(
2025-05-07T20:32:42.7148184Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7148287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7148412Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7148536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7148652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7148734Z     )
2025-05-07T20:32:42.7149043Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7149140Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7149226Z         self,
2025-05-07T20:32:42.7149309Z         T: int,
2025-05-07T20:32:42.7149393Z         D: int,
2025-05-07T20:32:42.7149498Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7149591Z         contiguous: bool,
2025-05-07T20:32:42.7149678Z         compiled: bool,
2025-05-07T20:32:42.7149939Z     ) -> None:
2025-05-07T20:32:42.7150036Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7150119Z     
2025-05-07T20:32:42.7150298Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7150375Z     
2025-05-07T20:32:42.7150474Z >       x_sign = torch.sign(x)
2025-05-07T20:32:42.7152435Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7152528Z 
2025-05-07T20:32:42.7152652Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:42.7152697Z 
2025-05-07T20:32:42.7152801Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7153032Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7153113Z     T=1,
2025-05-07T20:32:42.7153195Z     D=7168,
2025-05-07T20:32:42.7153281Z     scale_ub=1200.0,
2025-05-07T20:32:42.7153408Z     contiguous=True,
2025-05-07T20:32:42.7153491Z     compiled=False,
2025-05-07T20:32:42.7153565Z )
2025-05-07T20:32:42.7153798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7153970Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.7153975Z 
2025-05-07T20:32:42.7154056Z     @given(
2025-05-07T20:32:42.7154176Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7154278Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7154404Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7154522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7154646Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7154724Z     )
2025-05-07T20:32:42.7154987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7155086Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7159244Z         self,
2025-05-07T20:32:42.7159346Z         T: int,
2025-05-07T20:32:42.7159426Z         D: int,
2025-05-07T20:32:42.7159529Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7159623Z         contiguous: bool,
2025-05-07T20:32:42.7159711Z         compiled: bool,
2025-05-07T20:32:42.7159791Z     ) -> None:
2025-05-07T20:32:42.7159885Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7159958Z     
2025-05-07T20:32:42.7160141Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7160219Z     
2025-05-07T20:32:42.7160314Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7160443Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7160535Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7160616Z         x0 = x[:, :D]
2025-05-07T20:32:42.7160701Z         x1 = x[:, D:]
2025-05-07T20:32:42.7160775Z     
2025-05-07T20:32:42.7160855Z         if contiguous:
2025-05-07T20:32:42.7160950Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7161042Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7161114Z     
2025-05-07T20:32:42.7161207Z         if scale_ub is not None:
2025-05-07T20:32:42.7161312Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7161518Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7161595Z             )
2025-05-07T20:32:42.7161670Z         else:
2025-05-07T20:32:42.7161768Z             scale_ub_tensor = None
2025-05-07T20:32:42.7161841Z     
2025-05-07T20:32:42.7161974Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7162076Z             op = silu_mul_quant
2025-05-07T20:32:42.7162165Z             if compiled:
2025-05-07T20:32:42.7162267Z                 op = torch.compile(op)
2025-05-07T20:32:42.7162376Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7162447Z     
2025-05-07T20:32:42.7162537Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7162545Z 
2025-05-07T20:32:42.7162645Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7162779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7162882Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7162979Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7163535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7163642Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7164098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7164341Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7164745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7164838Z     kernel = self.compile(
2025-05-07T20:32:42.7165254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7165482Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7165614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7165621Z 
2025-05-07T20:32:42.7165840Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd38407c0>
2025-05-07T20:32:42.7166700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7167260Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd35eb0d0>}
2025-05-07T20:32:42.7168082Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7168286Z context = <triton._C.libtriton.ir.context object at 0x7f3dd35e9770>
2025-05-07T20:32:42.7168291Z 
2025-05-07T20:32:42.7168465Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7168747Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7168863Z                            module_map=module_map)
2025-05-07T20:32:42.7169031Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7169135Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7169220Z E       ^
2025-05-07T20:32:42.7169610Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7169615Z 
2025-05-07T20:32:42.7170075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7170082Z 
2025-05-07T20:32:42.7170190Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7170428Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7170514Z     T=128,
2025-05-07T20:32:42.7170638Z     D=5120,
2025-05-07T20:32:42.7170721Z     scale_ub=None,
2025-05-07T20:32:42.7170806Z     contiguous=True,
2025-05-07T20:32:42.7170889Z     compiled=False,
2025-05-07T20:32:42.7170967Z )
2025-05-07T20:32:42.7171194Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7171374Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.7171382Z 
2025-05-07T20:32:42.7171467Z     @given(
2025-05-07T20:32:42.7171589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7171694Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7171812Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7171931Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7172055Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7172132Z     )
2025-05-07T20:32:42.7172396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7172494Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7172575Z         self,
2025-05-07T20:32:42.7172654Z         T: int,
2025-05-07T20:32:42.7172738Z         D: int,
2025-05-07T20:32:42.7172838Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7172931Z         contiguous: bool,
2025-05-07T20:32:42.7173066Z         compiled: bool,
2025-05-07T20:32:42.7173146Z     ) -> None:
2025-05-07T20:32:42.7173284Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7173355Z     
2025-05-07T20:32:42.7173526Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7173605Z     
2025-05-07T20:32:42.7173694Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7173819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7173954Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7174037Z         x0 = x[:, :D]
2025-05-07T20:32:42.7174115Z         x1 = x[:, D:]
2025-05-07T20:32:42.7174194Z     
2025-05-07T20:32:42.7174281Z         if contiguous:
2025-05-07T20:32:42.7174369Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7174463Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7174535Z     
2025-05-07T20:32:42.7174627Z         if scale_ub is not None:
2025-05-07T20:32:42.7174733Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7174871Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7174948Z             )
2025-05-07T20:32:42.7175022Z         else:
2025-05-07T20:32:42.7175116Z             scale_ub_tensor = None
2025-05-07T20:32:42.7175193Z     
2025-05-07T20:32:42.7175323Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7175411Z             op = silu_mul_quant
2025-05-07T20:32:42.7175500Z             if compiled:
2025-05-07T20:32:42.7175603Z                 op = torch.compile(op)
2025-05-07T20:32:42.7175710Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7175787Z     
2025-05-07T20:32:42.7175876Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7175883Z 
2025-05-07T20:32:42.7175984Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7176116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7176218Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7176319Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7176870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7176973Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7177367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7177603Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7177974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7178071Z     kernel = self.compile(
2025-05-07T20:32:42.7178531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7178717Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7178848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7178855Z 
2025-05-07T20:32:42.7179070Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd3609730>
2025-05-07T20:32:42.7179922Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7180470Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd35ebaf0>}
2025-05-07T20:32:42.7181291Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7181489Z context = <triton._C.libtriton.ir.context object at 0x7f3dd36b3f70>
2025-05-07T20:32:42.7181493Z 
2025-05-07T20:32:42.7181718Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7181998Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7182170Z                            module_map=module_map)
2025-05-07T20:32:42.7182336Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7182434Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7182514Z E       ^
2025-05-07T20:32:42.7183241Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7183247Z 
2025-05-07T20:32:42.7183699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7183704Z 
2025-05-07T20:32:42.7183810Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7184042Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7184122Z     T=128,
2025-05-07T20:32:42.7184213Z     D=7168,
2025-05-07T20:32:42.7184300Z     scale_ub=None,
2025-05-07T20:32:42.7184396Z     contiguous=True,
2025-05-07T20:32:42.7184485Z     compiled=False,
2025-05-07T20:32:42.7184562Z )
2025-05-07T20:32:42.7184795Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7184975Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.7184980Z 
2025-05-07T20:32:42.7185070Z     @given(
2025-05-07T20:32:42.7185200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7185304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7185426Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7185550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7185665Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7185745Z     )
2025-05-07T20:32:42.7186006Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7186103Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7186188Z         self,
2025-05-07T20:32:42.7186267Z         T: int,
2025-05-07T20:32:42.7186347Z         D: int,
2025-05-07T20:32:42.7186451Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7186565Z         contiguous: bool,
2025-05-07T20:32:42.7186658Z         compiled: bool,
2025-05-07T20:32:42.7186761Z     ) -> None:
2025-05-07T20:32:42.7186858Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7186936Z     
2025-05-07T20:32:42.7187114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7187192Z     
2025-05-07T20:32:42.7187384Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7187513Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7187602Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7187686Z         x0 = x[:, :D]
2025-05-07T20:32:42.7187768Z         x1 = x[:, D:]
2025-05-07T20:32:42.7187840Z     
2025-05-07T20:32:42.7187927Z         if contiguous:
2025-05-07T20:32:42.7188021Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7188110Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7188187Z     
2025-05-07T20:32:42.7188278Z         if scale_ub is not None:
2025-05-07T20:32:42.7188381Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7188518Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7188590Z             )
2025-05-07T20:32:42.7188671Z         else:
2025-05-07T20:32:42.7188770Z             scale_ub_tensor = None
2025-05-07T20:32:42.7188847Z     
2025-05-07T20:32:42.7188982Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7189077Z             op = silu_mul_quant
2025-05-07T20:32:42.7189162Z             if compiled:
2025-05-07T20:32:42.7189268Z                 op = torch.compile(op)
2025-05-07T20:32:42.7189374Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7189449Z     
2025-05-07T20:32:42.7189547Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7189622Z 
2025-05-07T20:32:42.7189776Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7189968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7190078Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7190180Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7190727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7190885Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7191270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7191511Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7191879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7191978Z     kernel = self.compile(
2025-05-07T20:32:42.7192398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7192582Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7192720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7192724Z 
2025-05-07T20:32:42.7192941Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd36046d0>
2025-05-07T20:32:42.7193794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7194343Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd36f0550>}
2025-05-07T20:32:42.7195160Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7195364Z context = <triton._C.libtriton.ir.context object at 0x7f3dd36d7670>
2025-05-07T20:32:42.7195368Z 
2025-05-07T20:32:42.7195541Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7195827Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7195939Z                            module_map=module_map)
2025-05-07T20:32:42.7196106Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7196253Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7196333Z E       ^
2025-05-07T20:32:42.7196715Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7196720Z 
2025-05-07T20:32:42.7197174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7197181Z 
2025-05-07T20:32:42.7197282Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7197518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7197599Z     T=2048,
2025-05-07T20:32:42.7197678Z     D=7168,
2025-05-07T20:32:42.7197767Z     scale_ub=1200.0,
2025-05-07T20:32:42.7197855Z     contiguous=True,
2025-05-07T20:32:42.7197942Z     compiled=False,
2025-05-07T20:32:42.7198023Z )
2025-05-07T20:32:42.7198254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7198442Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.7198451Z 
2025-05-07T20:32:42.7198532Z     @given(
2025-05-07T20:32:42.7198654Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7198760Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7198926Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7199047Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7199203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7199277Z     )
2025-05-07T20:32:42.7199536Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7199636Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7199716Z         self,
2025-05-07T20:32:42.7199837Z         T: int,
2025-05-07T20:32:42.7199915Z         D: int,
2025-05-07T20:32:42.7200013Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7200105Z         contiguous: bool,
2025-05-07T20:32:42.7200196Z         compiled: bool,
2025-05-07T20:32:42.7200273Z     ) -> None:
2025-05-07T20:32:42.7200370Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7200444Z     
2025-05-07T20:32:42.7200618Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7202599Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7202610Z 
2025-05-07T20:32:42.7202726Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7202731Z 
2025-05-07T20:32:42.7202840Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7203072Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7203156Z     T=1,
2025-05-07T20:32:42.7203237Z     D=5120,
2025-05-07T20:32:42.7203324Z     scale_ub=1200.0,
2025-05-07T20:32:42.7203417Z     contiguous=True,
2025-05-07T20:32:42.7203505Z     compiled=False,
2025-05-07T20:32:42.7203584Z )
2025-05-07T20:32:42.7203819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7203991Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.7203995Z 
2025-05-07T20:32:42.7204074Z     @given(
2025-05-07T20:32:42.7204197Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7204303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7204422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7204543Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7204705Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7204787Z     )
2025-05-07T20:32:42.7205046Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7205137Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7205220Z         self,
2025-05-07T20:32:42.7205305Z         T: int,
2025-05-07T20:32:42.7205386Z         D: int,
2025-05-07T20:32:42.7205496Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7205586Z         contiguous: bool,
2025-05-07T20:32:42.7205672Z         compiled: bool,
2025-05-07T20:32:42.7205758Z     ) -> None:
2025-05-07T20:32:42.7205857Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7205933Z     
2025-05-07T20:32:42.7206111Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7206196Z     
2025-05-07T20:32:42.7206294Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7206420Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7206516Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7206607Z         x0 = x[:, :D]
2025-05-07T20:32:42.7206687Z         x1 = x[:, D:]
2025-05-07T20:32:42.7206763Z     
2025-05-07T20:32:42.7206854Z         if contiguous:
2025-05-07T20:32:42.7206947Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7207037Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7207161Z     
2025-05-07T20:32:42.7207252Z         if scale_ub is not None:
2025-05-07T20:32:42.7207394Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7207536Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7207614Z             )
2025-05-07T20:32:42.7207693Z         else:
2025-05-07T20:32:42.7207789Z             scale_ub_tensor = None
2025-05-07T20:32:42.7207867Z     
2025-05-07T20:32:42.7208040Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7208130Z             op = silu_mul_quant
2025-05-07T20:32:42.7208214Z             if compiled:
2025-05-07T20:32:42.7208320Z                 op = torch.compile(op)
2025-05-07T20:32:42.7208426Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7208496Z     
2025-05-07T20:32:42.7208593Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7208598Z 
2025-05-07T20:32:42.7208693Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7208827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7208933Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7209037Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7209586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7209685Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7210075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7210319Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7210691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7210792Z     kernel = self.compile(
2025-05-07T20:32:42.7211208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7211392Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7211531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7211535Z 
2025-05-07T20:32:42.7211750Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd36ce340>
2025-05-07T20:32:42.7212604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7213201Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd37d5280>}
2025-05-07T20:32:42.7214019Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7214218Z context = <triton._C.libtriton.ir.context object at 0x7f3dd37d49f0>
2025-05-07T20:32:42.7214225Z 
2025-05-07T20:32:42.7214399Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7214680Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7214790Z                            module_map=module_map)
2025-05-07T20:32:42.7214959Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7215064Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7215140Z E       ^
2025-05-07T20:32:42.7215526Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7215536Z 
2025-05-07T20:32:42.7215981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7215985Z 
2025-05-07T20:32:42.7216155Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7216390Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7216509Z     T=2048,
2025-05-07T20:32:42.7216586Z     D=5120,
2025-05-07T20:32:42.7216671Z     scale_ub=None,
2025-05-07T20:32:42.7216756Z     contiguous=True,
2025-05-07T20:32:42.7216839Z     compiled=False,
2025-05-07T20:32:42.7216918Z )
2025-05-07T20:32:42.7217145Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7217374Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.7217378Z 
2025-05-07T20:32:42.7217452Z     @given(
2025-05-07T20:32:42.7217574Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7217677Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7217794Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7217911Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7218031Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7218106Z     )
2025-05-07T20:32:42.7218364Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7218466Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7218545Z         self,
2025-05-07T20:32:42.7218629Z         T: int,
2025-05-07T20:32:42.7218711Z         D: int,
2025-05-07T20:32:42.7218813Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7218918Z         contiguous: bool,
2025-05-07T20:32:42.7219004Z         compiled: bool,
2025-05-07T20:32:42.7219085Z     ) -> None:
2025-05-07T20:32:42.7219188Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7219263Z     
2025-05-07T20:32:42.7219439Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7219522Z     
2025-05-07T20:32:42.7219615Z >       x_sign = torch.sign(x)
2025-05-07T20:32:42.7221587Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7221599Z 
2025-05-07T20:32:42.7221716Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:42.7221721Z 
2025-05-07T20:32:42.7221820Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7222101Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7222188Z     T=16384,
2025-05-07T20:32:42.7222270Z     D=5120,
2025-05-07T20:32:42.7222362Z     scale_ub=None,
2025-05-07T20:32:42.7222448Z     contiguous=True,
2025-05-07T20:32:42.7222531Z     compiled=False,
2025-05-07T20:32:42.7222617Z )
2025-05-07T20:32:42.7222847Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7223032Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.7223037Z 
2025-05-07T20:32:42.7223116Z     @given(
2025-05-07T20:32:42.7223234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7223336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7223456Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7223573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7223690Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7223764Z     )
2025-05-07T20:32:42.7224022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7224120Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7224197Z         self,
2025-05-07T20:32:42.7224275Z         T: int,
2025-05-07T20:32:42.7224408Z         D: int,
2025-05-07T20:32:42.7224507Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7224638Z         contiguous: bool,
2025-05-07T20:32:42.7224725Z         compiled: bool,
2025-05-07T20:32:42.7224806Z     ) -> None:
2025-05-07T20:32:42.7224906Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7224982Z     
2025-05-07T20:32:42.7225158Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7227163Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7227169Z 
2025-05-07T20:32:42.7227284Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7227291Z 
2025-05-07T20:32:42.7227395Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7227626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7227704Z     T=4096,
2025-05-07T20:32:42.7227785Z     D=5120,
2025-05-07T20:32:42.7227870Z     scale_ub=None,
2025-05-07T20:32:42.7227965Z     contiguous=True,
2025-05-07T20:32:42.7228049Z     compiled=False,
2025-05-07T20:32:42.7228126Z )
2025-05-07T20:32:42.7228357Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7228537Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.7228542Z 
2025-05-07T20:32:42.7228621Z     @given(
2025-05-07T20:32:42.7228746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7228849Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7228967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7229089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7229205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7229286Z     )
2025-05-07T20:32:42.7229545Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7229640Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7229831Z         self,
2025-05-07T20:32:42.7229914Z         T: int,
2025-05-07T20:32:42.7229993Z         D: int,
2025-05-07T20:32:42.7230096Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7230186Z         contiguous: bool,
2025-05-07T20:32:42.7230319Z         compiled: bool,
2025-05-07T20:32:42.7230399Z     ) -> None:
2025-05-07T20:32:42.7230493Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7230564Z     
2025-05-07T20:32:42.7230740Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7232686Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7232696Z 
2025-05-07T20:32:42.7232819Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7232826Z 
2025-05-07T20:32:42.7232931Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7233169Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7233248Z     T=2048,
2025-05-07T20:32:42.7233327Z     D=5120,
2025-05-07T20:32:42.7233415Z     scale_ub=None,
2025-05-07T20:32:42.7233554Z     contiguous=False,
2025-05-07T20:32:42.7233640Z     compiled=False,
2025-05-07T20:32:42.7233754Z )
2025-05-07T20:32:42.7233979Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7234158Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.7234163Z 
2025-05-07T20:32:42.7234242Z     @given(
2025-05-07T20:32:42.7234363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7234506Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7234618Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7234733Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7234849Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7234925Z     )
2025-05-07T20:32:42.7235182Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7235279Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7235358Z         self,
2025-05-07T20:32:42.7235440Z         T: int,
2025-05-07T20:32:42.7235520Z         D: int,
2025-05-07T20:32:42.7235622Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7235716Z         contiguous: bool,
2025-05-07T20:32:42.7235802Z         compiled: bool,
2025-05-07T20:32:42.7235883Z     ) -> None:
2025-05-07T20:32:42.7235983Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7236061Z     
2025-05-07T20:32:42.7236236Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7238192Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7238201Z 
2025-05-07T20:32:42.7238316Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7238320Z 
2025-05-07T20:32:42.7238425Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7238652Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7238726Z     T=4096,
2025-05-07T20:32:42.7238808Z     D=7168,
2025-05-07T20:32:42.7238895Z     scale_ub=None,
2025-05-07T20:32:42.7238983Z     contiguous=True,
2025-05-07T20:32:42.7239074Z     compiled=True,
2025-05-07T20:32:42.7239152Z )
2025-05-07T20:32:42.7239430Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7239606Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.7239611Z 
2025-05-07T20:32:42.7239686Z     @given(
2025-05-07T20:32:42.7239810Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7239915Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7240034Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7240156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7240270Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7240349Z     )
2025-05-07T20:32:42.7240612Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7240707Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7240788Z         self,
2025-05-07T20:32:42.7240868Z         T: int,
2025-05-07T20:32:42.7240947Z         D: int,
2025-05-07T20:32:42.7241052Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7241142Z         contiguous: bool,
2025-05-07T20:32:42.7241231Z         compiled: bool,
2025-05-07T20:32:42.7241318Z     ) -> None:
2025-05-07T20:32:42.7241414Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7241490Z     
2025-05-07T20:32:42.7241715Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7243666Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7243744Z 
2025-05-07T20:32:42.7243876Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7243881Z 
2025-05-07T20:32:42.7243989Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7244251Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7244330Z     T=2048,
2025-05-07T20:32:42.7244408Z     D=5120,
2025-05-07T20:32:42.7244501Z     scale_ub=1200.0,
2025-05-07T20:32:42.7244590Z     contiguous=False,
2025-05-07T20:32:42.7244679Z     compiled=False,
2025-05-07T20:32:42.7244761Z )
2025-05-07T20:32:42.7245012Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7245213Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.7245217Z 
2025-05-07T20:32:42.7245303Z     @given(
2025-05-07T20:32:42.7245428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7245536Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7245662Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7245785Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7245906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7245984Z     )
2025-05-07T20:32:42.7246277Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7246383Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7246461Z         self,
2025-05-07T20:32:42.7246542Z         T: int,
2025-05-07T20:32:42.7246627Z         D: int,
2025-05-07T20:32:42.7246730Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7246825Z         contiguous: bool,
2025-05-07T20:32:42.7246914Z         compiled: bool,
2025-05-07T20:32:42.7246995Z     ) -> None:
2025-05-07T20:32:42.7247097Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7247177Z     
2025-05-07T20:32:42.7247363Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7249745Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7249755Z 
2025-05-07T20:32:42.7249874Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7249879Z 
2025-05-07T20:32:42.7249981Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7250211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7250290Z     T=4096,
2025-05-07T20:32:42.7250376Z     D=7168,
2025-05-07T20:32:42.7250460Z     scale_ub=1200.0,
2025-05-07T20:32:42.7250551Z     contiguous=True,
2025-05-07T20:32:42.7250640Z     compiled=False,
2025-05-07T20:32:42.7250716Z )
2025-05-07T20:32:42.7250951Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7251132Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.7251137Z 
2025-05-07T20:32:42.7251212Z     @given(
2025-05-07T20:32:42.7251381Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7251517Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7251631Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7251748Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7251860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7251935Z     )
2025-05-07T20:32:42.7252245Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7252340Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7252423Z         self,
2025-05-07T20:32:42.7252501Z         T: int,
2025-05-07T20:32:42.7252585Z         D: int,
2025-05-07T20:32:42.7252690Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7252781Z         contiguous: bool,
2025-05-07T20:32:42.7252867Z         compiled: bool,
2025-05-07T20:32:42.7252949Z     ) -> None:
2025-05-07T20:32:42.7253045Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7253124Z     
2025-05-07T20:32:42.7253302Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7255258Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7255266Z 
2025-05-07T20:32:42.7255389Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7255393Z 
2025-05-07T20:32:42.7255495Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7255731Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7255815Z     T=16384,
2025-05-07T20:32:42.7255894Z     D=7168,
2025-05-07T20:32:42.7255983Z     scale_ub=None,
2025-05-07T20:32:42.7256069Z     contiguous=False,
2025-05-07T20:32:42.7256154Z     compiled=True,
2025-05-07T20:32:42.7256233Z )
2025-05-07T20:32:42.7256460Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7256646Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.7256656Z 
2025-05-07T20:32:42.7256735Z     @given(
2025-05-07T20:32:42.7256855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7257003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7257119Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7257237Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7257352Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7257426Z     )
2025-05-07T20:32:42.7257687Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7257793Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7257872Z         self,
2025-05-07T20:32:42.7257953Z         T: int,
2025-05-07T20:32:42.7258034Z         D: int,
2025-05-07T20:32:42.7258136Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7258229Z         contiguous: bool,
2025-05-07T20:32:42.7258318Z         compiled: bool,
2025-05-07T20:32:42.7258399Z     ) -> None:
2025-05-07T20:32:42.7258499Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7258574Z     
2025-05-07T20:32:42.7258748Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7260746Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7260787Z 
2025-05-07T20:32:42.7260903Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7260907Z 
2025-05-07T20:32:42.7261015Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7261283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7261357Z     T=4096,
2025-05-07T20:32:42.7261440Z     D=7168,
2025-05-07T20:32:42.7261528Z     scale_ub=None,
2025-05-07T20:32:42.7261624Z     contiguous=True,
2025-05-07T20:32:42.7261711Z     compiled=False,
2025-05-07T20:32:42.7261788Z )
2025-05-07T20:32:42.7262020Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7262196Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.7262203Z 
2025-05-07T20:32:42.7262283Z     @given(
2025-05-07T20:32:42.7262412Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7262514Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7262629Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7262751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7262867Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7262952Z     )
2025-05-07T20:32:42.7263215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7263311Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7263399Z         self,
2025-05-07T20:32:42.7263478Z         T: int,
2025-05-07T20:32:42.7263556Z         D: int,
2025-05-07T20:32:42.7263660Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7263751Z         contiguous: bool,
2025-05-07T20:32:42.7263836Z         compiled: bool,
2025-05-07T20:32:42.7263920Z     ) -> None:
2025-05-07T20:32:42.7264018Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7264097Z     
2025-05-07T20:32:42.7264272Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7266270Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7266279Z 
2025-05-07T20:32:42.7266403Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7266408Z 
2025-05-07T20:32:42.7266511Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7266751Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7266833Z     T=16384,
2025-05-07T20:32:42.7266911Z     D=7168,
2025-05-07T20:32:42.7267001Z     scale_ub=None,
2025-05-07T20:32:42.7267086Z     contiguous=True,
2025-05-07T20:32:42.7267175Z     compiled=False,
2025-05-07T20:32:42.7267250Z )
2025-05-07T20:32:42.7267479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7267662Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.7267669Z 
2025-05-07T20:32:42.7267745Z     @given(
2025-05-07T20:32:42.7267868Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7267973Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7268088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7268208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7268322Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7268443Z     )
2025-05-07T20:32:42.7268704Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7268837Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7268913Z         self,
2025-05-07T20:32:42.7268987Z         T: int,
2025-05-07T20:32:42.7269068Z         D: int,
2025-05-07T20:32:42.7269169Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7269262Z         contiguous: bool,
2025-05-07T20:32:42.7269392Z         compiled: bool,
2025-05-07T20:32:42.7269470Z     ) -> None:
2025-05-07T20:32:42.7269575Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7269648Z     
2025-05-07T20:32:42.7269907Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7271865Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7271873Z 
2025-05-07T20:32:42.7271992Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7272001Z 
2025-05-07T20:32:42.7272108Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7272339Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7272420Z     T=16384,
2025-05-07T20:32:42.7272506Z     D=7168,
2025-05-07T20:32:42.7272588Z     scale_ub=1200.0,
2025-05-07T20:32:42.7272676Z     contiguous=True,
2025-05-07T20:32:42.7272762Z     compiled=False,
2025-05-07T20:32:42.7272834Z )
2025-05-07T20:32:42.7273064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7273250Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.7273257Z 
2025-05-07T20:32:42.7273337Z     @given(
2025-05-07T20:32:42.7273459Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7273557Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7273670Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7273790Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7273908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7273988Z     )
2025-05-07T20:32:42.7274294Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7274389Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7274472Z         self,
2025-05-07T20:32:42.7274551Z         T: int,
2025-05-07T20:32:42.7274630Z         D: int,
2025-05-07T20:32:42.7274735Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7274825Z         contiguous: bool,
2025-05-07T20:32:42.7274914Z         compiled: bool,
2025-05-07T20:32:42.7274997Z     ) -> None:
2025-05-07T20:32:42.7275098Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7275174Z     
2025-05-07T20:32:42.7275350Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7277304Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7277312Z 
2025-05-07T20:32:42.7277434Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7277438Z 
2025-05-07T20:32:42.7277597Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7277865Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7277942Z     T=128,
2025-05-07T20:32:42.7278017Z     D=5120,
2025-05-07T20:32:42.7278101Z     scale_ub=1200.0,
2025-05-07T20:32:42.7278185Z     contiguous=False,
2025-05-07T20:32:42.7278269Z     compiled=False,
2025-05-07T20:32:42.7278347Z )
2025-05-07T20:32:42.7278637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7278816Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.7278824Z 
2025-05-07T20:32:42.7278907Z     @given(
2025-05-07T20:32:42.7279027Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7279130Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7279247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7279366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7279488Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7279569Z     )
2025-05-07T20:32:42.7279829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7279930Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7280011Z         self,
2025-05-07T20:32:42.7280091Z         T: int,
2025-05-07T20:32:42.7280174Z         D: int,
2025-05-07T20:32:42.7280278Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7280378Z         contiguous: bool,
2025-05-07T20:32:42.7280470Z         compiled: bool,
2025-05-07T20:32:42.7280555Z     ) -> None:
2025-05-07T20:32:42.7280661Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7280741Z     
2025-05-07T20:32:42.7280918Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7281000Z     
2025-05-07T20:32:42.7281096Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7281228Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7281325Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7281411Z         x0 = x[:, :D]
2025-05-07T20:32:42.7281501Z         x1 = x[:, D:]
2025-05-07T20:32:42.7281582Z     
2025-05-07T20:32:42.7281669Z         if contiguous:
2025-05-07T20:32:42.7281765Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7281862Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7281938Z     
2025-05-07T20:32:42.7282035Z         if scale_ub is not None:
2025-05-07T20:32:42.7282148Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7282290Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7282374Z             )
2025-05-07T20:32:42.7282503Z         else:
2025-05-07T20:32:42.7282601Z             scale_ub_tensor = None
2025-05-07T20:32:42.7282678Z     
2025-05-07T20:32:42.7282985Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7283114Z             op = silu_mul_quant
2025-05-07T20:32:42.7283233Z             if compiled:
2025-05-07T20:32:42.7283343Z                 op = torch.compile(op)
2025-05-07T20:32:42.7283455Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7283539Z     
2025-05-07T20:32:42.7283635Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7283639Z 
2025-05-07T20:32:42.7283742Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7283881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7283987Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7284101Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7284652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7284757Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7285153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7285392Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7285855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7286004Z     kernel = self.compile(
2025-05-07T20:32:42.7286421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7286614Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7286809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7286813Z 
2025-05-07T20:32:42.7287034Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd34f0340>
2025-05-07T20:32:42.7287891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7288455Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd34a6940>}
2025-05-07T20:32:42.7293117Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7293328Z context = <triton._C.libtriton.ir.context object at 0x7f3dd34cbcf0>
2025-05-07T20:32:42.7293338Z 
2025-05-07T20:32:42.7293520Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7293806Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7293913Z                            module_map=module_map)
2025-05-07T20:32:42.7294082Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7294181Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7294262Z E       ^
2025-05-07T20:32:42.7294651Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7294658Z 
2025-05-07T20:32:42.7295104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7295109Z 
2025-05-07T20:32:42.7295218Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7295450Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7295532Z     T=2048,
2025-05-07T20:32:42.7295608Z     D=7168,
2025-05-07T20:32:42.7295693Z     scale_ub=None,
2025-05-07T20:32:42.7295874Z     contiguous=False,
2025-05-07T20:32:42.7295958Z     compiled=False,
2025-05-07T20:32:42.7296034Z )
2025-05-07T20:32:42.7296263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7296442Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.7296446Z 
2025-05-07T20:32:42.7296525Z     @given(
2025-05-07T20:32:42.7296645Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7296745Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7296860Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7296978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7297091Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7297169Z     )
2025-05-07T20:32:42.7297431Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7297523Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7297606Z         self,
2025-05-07T20:32:42.7297685Z         T: int,
2025-05-07T20:32:42.7297766Z         D: int,
2025-05-07T20:32:42.7297867Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7297953Z         contiguous: bool,
2025-05-07T20:32:42.7298036Z         compiled: bool,
2025-05-07T20:32:42.7298117Z     ) -> None:
2025-05-07T20:32:42.7298270Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7298347Z     
2025-05-07T20:32:42.7298529Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7300537Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7300586Z 
2025-05-07T20:32:42.7300705Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7300710Z 
2025-05-07T20:32:42.7300814Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7301050Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7301129Z     T=128,
2025-05-07T20:32:42.7301203Z     D=7168,
2025-05-07T20:32:42.7301291Z     scale_ub=1200.0,
2025-05-07T20:32:42.7301376Z     contiguous=True,
2025-05-07T20:32:42.7301459Z     compiled=True,
2025-05-07T20:32:42.7301534Z )
2025-05-07T20:32:42.7301760Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7301938Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.7301946Z 
2025-05-07T20:32:42.7302022Z     @given(
2025-05-07T20:32:42.7302140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7302247Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7302362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7302478Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7302597Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7302668Z     )
2025-05-07T20:32:42.7302929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7303030Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7303112Z         self,
2025-05-07T20:32:42.7303198Z         T: int,
2025-05-07T20:32:42.7303279Z         D: int,
2025-05-07T20:32:42.7303380Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7303473Z         contiguous: bool,
2025-05-07T20:32:42.7303560Z         compiled: bool,
2025-05-07T20:32:42.7303643Z     ) -> None:
2025-05-07T20:32:42.7303742Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7303818Z     
2025-05-07T20:32:42.7304040Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7304122Z     
2025-05-07T20:32:42.7304214Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7304337Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7304430Z         x = x_sign * x_clamp
2025-05-07T20:32:42.7304511Z         x0 = x[:, :D]
2025-05-07T20:32:42.7304593Z         x1 = x[:, D:]
2025-05-07T20:32:42.7304668Z     
2025-05-07T20:32:42.7304749Z         if contiguous:
2025-05-07T20:32:42.7304846Z             x0 = x0.contiguous()
2025-05-07T20:32:42.7304934Z             x1 = x1.contiguous()
2025-05-07T20:32:42.7305008Z     
2025-05-07T20:32:42.7305100Z         if scale_ub is not None:
2025-05-07T20:32:42.7305204Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.7305343Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.7305420Z             )
2025-05-07T20:32:42.7305497Z         else:
2025-05-07T20:32:42.7305594Z             scale_ub_tensor = None
2025-05-07T20:32:42.7305669Z     
2025-05-07T20:32:42.7305801Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.7305893Z             op = silu_mul_quant
2025-05-07T20:32:42.7305977Z             if compiled:
2025-05-07T20:32:42.7306075Z                 op = torch.compile(op)
2025-05-07T20:32:42.7306184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7306301Z     
2025-05-07T20:32:42.7306397Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.7306442Z 
2025-05-07T20:32:42.7306546Z moe/activation_test.py:117: 
2025-05-07T20:32:42.7306685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7306793Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.7306893Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.7307294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.7307429Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.7307970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.7308070Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.7308458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.7308697Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.7309067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.7309163Z     kernel = self.compile(
2025-05-07T20:32:42.7309576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.7309883Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.7310023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.7310027Z 
2025-05-07T20:32:42.7310247Z self = <triton.compiler.compiler.ASTSource object at 0x7f3dd341ee50>
2025-05-07T20:32:42.7311109Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.7311668Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3dc86c58b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3dd3498940>}
2025-05-07T20:32:42.7312493Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.7312696Z context = <triton._C.libtriton.ir.context object at 0x7f3d880cdb30>
2025-05-07T20:32:42.7312700Z 
2025-05-07T20:32:42.7312876Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.7313202Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.7313312Z                            module_map=module_map)
2025-05-07T20:32:42.7313482Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.7313580Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.7313659Z E       ^
2025-05-07T20:32:42.7314039Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.7314046Z 
2025-05-07T20:32:42.7314492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.7314496Z 
2025-05-07T20:32:42.7314599Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7314836Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7314914Z     T=128,
2025-05-07T20:32:42.7314998Z     D=7168,
2025-05-07T20:32:42.7315084Z     scale_ub=1200.0,
2025-05-07T20:32:42.7315173Z     contiguous=True,
2025-05-07T20:32:42.7315259Z     compiled=False,
2025-05-07T20:32:42.7315334Z )
2025-05-07T20:32:42.7315565Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7315787Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.7315792Z 
2025-05-07T20:32:42.7315871Z     @given(
2025-05-07T20:32:42.7316066Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7316164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7316277Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7316394Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7316507Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7316626Z     )
2025-05-07T20:32:42.7316885Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7316983Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7317064Z         self,
2025-05-07T20:32:42.7317142Z         T: int,
2025-05-07T20:32:42.7317220Z         D: int,
2025-05-07T20:32:42.7317323Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7317416Z         contiguous: bool,
2025-05-07T20:32:42.7317503Z         compiled: bool,
2025-05-07T20:32:42.7317590Z     ) -> None:
2025-05-07T20:32:42.7317687Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7317767Z     
2025-05-07T20:32:42.7317949Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7318029Z     
2025-05-07T20:32:42.7318130Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7318257Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7320220Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7320233Z 
2025-05-07T20:32:42.7320349Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.7320355Z 
2025-05-07T20:32:42.7320460Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7320692Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7320773Z     T=128,
2025-05-07T20:32:42.7320854Z     D=5120,
2025-05-07T20:32:42.7320942Z     scale_ub=1200.0,
2025-05-07T20:32:42.7321029Z     contiguous=True,
2025-05-07T20:32:42.7321116Z     compiled=True,
2025-05-07T20:32:42.7321196Z )
2025-05-07T20:32:42.7321422Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7321644Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.7321649Z 
2025-05-07T20:32:42.7321726Z     @given(
2025-05-07T20:32:42.7321843Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7321944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7322059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7322173Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7322292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7322365Z     )
2025-05-07T20:32:42.7322624Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7322727Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7322807Z         self,
2025-05-07T20:32:42.7322890Z         T: int,
2025-05-07T20:32:42.7322969Z         D: int,
2025-05-07T20:32:42.7323069Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7323161Z         contiguous: bool,
2025-05-07T20:32:42.7323249Z         compiled: bool,
2025-05-07T20:32:42.7323330Z     ) -> None:
2025-05-07T20:32:42.7323428Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7323502Z     
2025-05-07T20:32:42.7323677Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7323761Z     
2025-05-07T20:32:42.7323855Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.7324026Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.7326011Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7326054Z 
2025-05-07T20:32:42.7326171Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.7326179Z 
2025-05-07T20:32:42.7326279Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.7326508Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.7326593Z     T=128,
2025-05-07T20:32:42.7326675Z     D=7168,
2025-05-07T20:32:42.7326764Z     scale_ub=None,
2025-05-07T20:32:42.7326857Z     contiguous=True,
2025-05-07T20:32:42.7326943Z     compiled=True,
2025-05-07T20:32:42.7327021Z )
2025-05-07T20:32:42.7327256Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.7327435Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.7327443Z 
2025-05-07T20:32:42.7327527Z     @given(
2025-05-07T20:32:42.7327649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.7327749Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.7327871Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.7327989Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.7328105Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.7328183Z     )
2025-05-07T20:32:42.7328446Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.7328539Z     def test_silu_mul_quant(
2025-05-07T20:32:42.7328622Z         self,
2025-05-07T20:32:42.7328700Z         T: int,
2025-05-07T20:32:42.7328776Z         D: int,
2025-05-07T20:32:42.7328879Z         scale_ub: Optional[float],
2025-05-07T20:32:42.7328970Z         contiguous: bool,
2025-05-07T20:32:42.7329058Z         compiled: bool,
2025-05-07T20:32:42.7329139Z     ) -> None:
2025-05-07T20:32:42.7329236Z         torch.manual_seed(2025)
2025-05-07T20:32:42.7329317Z     
2025-05-07T20:32:42.7329491Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.7331489Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.7331501Z 
2025-05-07T20:32:42.7331620Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.7331758Z =============================== warnings summary ===============================
2025-05-07T20:32:42.7332088Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:42.7332409Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:42.7332727Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:42.7333728Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:42.7334005Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:42.7334009Z 
2025-05-07T20:32:42.7334236Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:42.7334414Z ================= 1 failed, 1 deselected, 3 warnings in 24.06s =================
2025-05-07T20:32:44.3671760Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:44.4297609Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:44.4297943Z 
2025-05-07T20:32:46.4317459Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:48.5854384Z ============================= test session starts ==============================
2025-05-07T20:32:48.5855071Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:48.5855625Z cachedir: .pytest_cache
2025-05-07T20:32:48.5856237Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:48.5857009Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:48.5857437Z plugins: hypothesis-6.131.14
2025-05-07T20:32:50.2045264Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:50.4172849Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:50.4173284Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:50.4173513Z 
2025-05-07T20:32:53.1083531Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.1084287Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.1084748Z     T=1,
2025-05-07T20:32:53.1084947Z     D=5120,
2025-05-07T20:32:53.1085150Z     scale_ub=None,
2025-05-07T20:32:53.1085378Z     contiguous=True,
2025-05-07T20:32:53.1085625Z     compiled=True,
2025-05-07T20:32:53.1085842Z )
2025-05-07T20:32:53.1086181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.1086704Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:53.1086992Z 
2025-05-07T20:32:53.1087074Z     @given(
2025-05-07T20:32:53.1087597Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.1087934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.1088249Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.1088598Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.1088950Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.1089273Z     )
2025-05-07T20:32:53.1089671Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.1090154Z     def test_silu_mul_quant(
2025-05-07T20:32:53.1090404Z         self,
2025-05-07T20:32:53.1090607Z         T: int,
2025-05-07T20:32:53.1090816Z         D: int,
2025-05-07T20:32:53.1091037Z         scale_ub: Optional[float],
2025-05-07T20:32:53.1091326Z         contiguous: bool,
2025-05-07T20:32:53.1091578Z         compiled: bool,
2025-05-07T20:32:53.1091821Z     ) -> None:
2025-05-07T20:32:53.1092046Z         torch.manual_seed(2025)
2025-05-07T20:32:53.1092311Z     
2025-05-07T20:32:53.1092600Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.1092966Z     
2025-05-07T20:32:53.1093173Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.1093481Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.1093896Z         x = x_sign * x_clamp
2025-05-07T20:32:53.1094149Z         x0 = x[:, :D]
2025-05-07T20:32:53.1094444Z         x1 = x[:, D:]
2025-05-07T20:32:53.1094655Z     
2025-05-07T20:32:53.1094848Z         if contiguous:
2025-05-07T20:32:53.1095086Z             x0 = x0.contiguous()
2025-05-07T20:32:53.1095348Z             x1 = x1.contiguous()
2025-05-07T20:32:53.1095602Z     
2025-05-07T20:32:53.1095800Z         if scale_ub is not None:
2025-05-07T20:32:53.1096165Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.1096515Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.1096844Z             )
2025-05-07T20:32:53.1097037Z         else:
2025-05-07T20:32:53.1097258Z             scale_ub_tensor = None
2025-05-07T20:32:53.1097519Z     
2025-05-07T20:32:53.1097756Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.1098079Z             op = silu_mul_quant
2025-05-07T20:32:53.1098344Z             if compiled:
2025-05-07T20:32:53.1098604Z                 op = torch.compile(op)
2025-05-07T20:32:53.1098911Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.1099256Z     
2025-05-07T20:32:53.1099461Z         y_fp8, y_scale = fn()
2025-05-07T20:32:53.1099752Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:53.1100062Z     
2025-05-07T20:32:53.1100308Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.1100655Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:53.1100970Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:53.1101307Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:53.1101684Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.1102005Z     
2025-05-07T20:32:53.1102208Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:53.1102410Z 
2025-05-07T20:32:53.1102516Z moe/activation_test.py:126: 
2025-05-07T20:32:53.1102816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1103172Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:53.1103514Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.1104366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:53.1105190Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:53.1105773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.1106510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.1107302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:53.1108082Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.1108892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:53.1109693Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.1110646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:53.1111332Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:53.1111974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:53.1112525Z     fn()
2025-05-07T20:32:53.1113066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:53.1113691Z     self.fn.run(
2025-05-07T20:32:53.1114180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.1114742Z     kernel = self.compile(
2025-05-07T20:32:53.1115367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.1116112Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.1116522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.1116774Z 
2025-05-07T20:32:53.1116988Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bbeb9040>
2025-05-07T20:32:53.1118178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.1119757Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bbfdc9d0>}
2025-05-07T20:32:53.1121282Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.1122389Z context = <triton._C.libtriton.ir.context object at 0x7fd1a8b3ca30>
2025-05-07T20:32:53.1122700Z 
2025-05-07T20:32:53.1122870Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.1123418Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.1123918Z                            module_map=module_map)
2025-05-07T20:32:53.1124293Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.1124667Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:53.1124944Z E       ^
2025-05-07T20:32:53.1125439Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.1125940Z 
2025-05-07T20:32:53.1126391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.1126957Z 
2025-05-07T20:32:53.1127063Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.1127499Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.1127920Z     T=2048,
2025-05-07T20:32:53.1128115Z     D=5120,
2025-05-07T20:32:53.1128311Z     scale_ub=1200.0,
2025-05-07T20:32:53.1128531Z     contiguous=True,
2025-05-07T20:32:53.1128767Z     compiled=False,
2025-05-07T20:32:53.1128978Z )
2025-05-07T20:32:54.6256795Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.6257653Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.6257960Z 
2025-05-07T20:32:54.6258041Z     @given(
2025-05-07T20:32:54.6258280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.6258599Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.6258943Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.6259295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.6259638Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.6259929Z     )
2025-05-07T20:32:54.6260289Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.6260752Z     def test_silu_mul_quant(
2025-05-07T20:32:54.6260998Z         self,
2025-05-07T20:32:54.6261190Z         T: int,
2025-05-07T20:32:54.6261384Z         D: int,
2025-05-07T20:32:54.6261602Z         scale_ub: Optional[float],
2025-05-07T20:32:54.6261878Z         contiguous: bool,
2025-05-07T20:32:54.6262113Z         compiled: bool,
2025-05-07T20:32:54.6262345Z     ) -> None:
2025-05-07T20:32:54.6262564Z         torch.manual_seed(2025)
2025-05-07T20:32:54.6262801Z     
2025-05-07T20:32:54.6263076Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.6263437Z     
2025-05-07T20:32:54.6263626Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.6264006Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.6264400Z         x = x_sign * x_clamp
2025-05-07T20:32:54.6264650Z         x0 = x[:, :D]
2025-05-07T20:32:54.6264861Z         x1 = x[:, D:]
2025-05-07T20:32:54.6265069Z     
2025-05-07T20:32:54.6265257Z         if contiguous:
2025-05-07T20:32:54.6265487Z             x0 = x0.contiguous()
2025-05-07T20:32:54.6265753Z             x1 = x1.contiguous()
2025-05-07T20:32:54.6266089Z     
2025-05-07T20:32:54.6266281Z         if scale_ub is not None:
2025-05-07T20:32:54.6266563Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.6266910Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.6267222Z             )
2025-05-07T20:32:54.6267413Z         else:
2025-05-07T20:32:54.6267627Z             scale_ub_tensor = None
2025-05-07T20:32:54.6267874Z     
2025-05-07T20:32:54.6268104Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.6268432Z             op = silu_mul_quant
2025-05-07T20:32:54.6268679Z             if compiled:
2025-05-07T20:32:54.6268937Z                 op = torch.compile(op)
2025-05-07T20:32:54.6269239Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.6269519Z     
2025-05-07T20:32:54.6269703Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.6269995Z 
2025-05-07T20:32:54.6270095Z moe/activation_test.py:117: 
2025-05-07T20:32:54.6270401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.6270747Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.6271037Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.6271786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.6272529Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.6273098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.6273835Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.6274548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.6275114Z     kernel = self.compile(
2025-05-07T20:32:54.6275687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.6276393Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.6276809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.6277049Z 
2025-05-07T20:32:54.6277314Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1a835b790>
2025-05-07T20:32:54.6278498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.6280029Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bb06cdc0>}
2025-05-07T20:32:54.6281510Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.6282621Z context = <triton._C.libtriton.ir.context object at 0x7fd1baee4cf0>
2025-05-07T20:32:54.6283097Z 
2025-05-07T20:32:54.6283268Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.6283824Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.6284316Z                            module_map=module_map)
2025-05-07T20:32:54.6284689Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.6285130Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.6285402Z E       ^
2025-05-07T20:32:54.6285970Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.6286465Z 
2025-05-07T20:32:54.6286914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.6287477Z 
2025-05-07T20:32:54.6287639Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.6288069Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.6288483Z     T=2048,
2025-05-07T20:32:54.6288677Z     D=5120,
2025-05-07T20:32:54.6288877Z     scale_ub=1200.0,
2025-05-07T20:32:54.6289097Z     contiguous=True,
2025-05-07T20:32:54.6289319Z     compiled=True,
2025-05-07T20:32:54.6289550Z )
2025-05-07T20:32:54.6289894Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.6290415Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.6290709Z 
2025-05-07T20:32:54.6290789Z     @given(
2025-05-07T20:32:54.6291033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.6291354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.6291674Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.6300115Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.6300531Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.6300837Z     )
2025-05-07T20:32:54.6301207Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.6301682Z     def test_silu_mul_quant(
2025-05-07T20:32:54.6301932Z         self,
2025-05-07T20:32:54.6302133Z         T: int,
2025-05-07T20:32:54.6302331Z         D: int,
2025-05-07T20:32:54.6302558Z         scale_ub: Optional[float],
2025-05-07T20:32:54.6302844Z         contiguous: bool,
2025-05-07T20:32:54.6303085Z         compiled: bool,
2025-05-07T20:32:54.6303320Z     ) -> None:
2025-05-07T20:32:54.6303541Z         torch.manual_seed(2025)
2025-05-07T20:32:54.6303789Z     
2025-05-07T20:32:54.6304070Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.6304434Z     
2025-05-07T20:32:54.6304622Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.6304926Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.6305252Z         x = x_sign * x_clamp
2025-05-07T20:32:54.6305502Z         x0 = x[:, :D]
2025-05-07T20:32:54.6305726Z         x1 = x[:, D:]
2025-05-07T20:32:54.6305942Z     
2025-05-07T20:32:54.6306122Z         if contiguous:
2025-05-07T20:32:54.6306469Z             x0 = x0.contiguous()
2025-05-07T20:32:54.6306742Z             x1 = x1.contiguous()
2025-05-07T20:32:54.6306993Z     
2025-05-07T20:32:54.6307182Z         if scale_ub is not None:
2025-05-07T20:32:54.6307465Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.6307815Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.6308132Z             )
2025-05-07T20:32:54.6308332Z         else:
2025-05-07T20:32:54.6308545Z             scale_ub_tensor = None
2025-05-07T20:32:54.6308795Z     
2025-05-07T20:32:54.6309036Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.6309365Z             op = silu_mul_quant
2025-05-07T20:32:54.6309626Z             if compiled:
2025-05-07T20:32:54.6309982Z                 op = torch.compile(op)
2025-05-07T20:32:54.6310287Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.6310579Z     
2025-05-07T20:32:54.6310770Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.6311060Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.6311362Z     
2025-05-07T20:32:54.6311602Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.6311952Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.6312250Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.6312625Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.6313040Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.6313357Z     
2025-05-07T20:32:54.6313560Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.6313763Z 
2025-05-07T20:32:54.6313876Z moe/activation_test.py:126: 
2025-05-07T20:32:54.6314176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.6314569Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.6314904Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.6315757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.6316562Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.6317141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.6317880Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.6318612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.6319388Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.6320222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:54.6321054Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.6321834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.6322520Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.6323163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.6323719Z     fn()
2025-05-07T20:32:54.6324255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.6324883Z     self.fn.run(
2025-05-07T20:32:54.6325374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.6325936Z     kernel = self.compile(
2025-05-07T20:32:54.6326510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.6327211Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.6327672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.6327915Z 
2025-05-07T20:32:54.6328128Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1baa7b370>
2025-05-07T20:32:54.6329305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.6330819Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1baa53550>}
2025-05-07T20:32:54.6332291Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.6333407Z context = <triton._C.libtriton.ir.context object at 0x7fd1ba8d18b0>
2025-05-07T20:32:54.6333715Z 
2025-05-07T20:32:54.6333885Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.6334433Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.6334924Z                            module_map=module_map)
2025-05-07T20:32:54.6335336Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.6335738Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.6336013Z E       ^
2025-05-07T20:32:54.6336505Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.6336993Z 
2025-05-07T20:32:54.6337439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.6338038Z 
2025-05-07T20:32:54.6338139Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.6338567Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.6338983Z     T=16384,
2025-05-07T20:32:54.6339179Z     D=7168,
2025-05-07T20:32:54.6339371Z     scale_ub=1200.0,
2025-05-07T20:32:54.6339594Z     contiguous=False,
2025-05-07T20:32:54.6339817Z     compiled=False,
2025-05-07T20:32:54.6340023Z )
2025-05-07T20:32:55.9595542Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.9596199Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.9596703Z 
2025-05-07T20:32:55.9596796Z     @given(
2025-05-07T20:32:55.9597049Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.9597389Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.9597710Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.9598070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.9598423Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.9598719Z     )
2025-05-07T20:32:55.9599106Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.9599598Z     def test_silu_mul_quant(
2025-05-07T20:32:55.9600022Z         self,
2025-05-07T20:32:55.9600218Z         T: int,
2025-05-07T20:32:55.9600417Z         D: int,
2025-05-07T20:32:55.9600633Z         scale_ub: Optional[float],
2025-05-07T20:32:55.9600921Z         contiguous: bool,
2025-05-07T20:32:55.9601174Z         compiled: bool,
2025-05-07T20:32:55.9601403Z     ) -> None:
2025-05-07T20:32:55.9601622Z         torch.manual_seed(2025)
2025-05-07T20:32:55.9601871Z     
2025-05-07T20:32:55.9602155Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.9602516Z     
2025-05-07T20:32:55.9602712Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.9603015Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.9603336Z         x = x_sign * x_clamp
2025-05-07T20:32:55.9603584Z         x0 = x[:, :D]
2025-05-07T20:32:55.9603804Z         x1 = x[:, D:]
2025-05-07T20:32:55.9604143Z     
2025-05-07T20:32:55.9604335Z         if contiguous:
2025-05-07T20:32:55.9604573Z             x0 = x0.contiguous()
2025-05-07T20:32:55.9604834Z             x1 = x1.contiguous()
2025-05-07T20:32:55.9605083Z     
2025-05-07T20:32:55.9605282Z         if scale_ub is not None:
2025-05-07T20:32:55.9605559Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.9605915Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.9606245Z             )
2025-05-07T20:32:55.9606435Z         else:
2025-05-07T20:32:55.9606645Z             scale_ub_tensor = None
2025-05-07T20:32:55.9606904Z     
2025-05-07T20:32:55.9607138Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.9607458Z             op = silu_mul_quant
2025-05-07T20:32:55.9607718Z             if compiled:
2025-05-07T20:32:55.9607971Z                 op = torch.compile(op)
2025-05-07T20:32:55.9608272Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9608566Z     
2025-05-07T20:32:55.9608759Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.9608927Z 
2025-05-07T20:32:55.9609025Z moe/activation_test.py:117: 
2025-05-07T20:32:55.9609332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9609682Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.9610034Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9610831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.9611576Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.9612141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.9612931Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.9613642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.9614212Z     kernel = self.compile(
2025-05-07T20:32:55.9614788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.9615482Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.9615899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9616144Z 
2025-05-07T20:32:55.9616362Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bac2f7c0>
2025-05-07T20:32:55.9617528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.9619042Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1baa533a0>}
2025-05-07T20:32:55.9620572Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.9621683Z context = <triton._C.libtriton.ir.context object at 0x7fd1ba424cf0>
2025-05-07T20:32:55.9621992Z 
2025-05-07T20:32:55.9622167Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.9622711Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.9623200Z                            module_map=module_map)
2025-05-07T20:32:55.9623577Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.9623937Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.9624205Z E       ^
2025-05-07T20:32:55.9624697Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.9625236Z 
2025-05-07T20:32:55.9625690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.9626245Z 
2025-05-07T20:32:55.9626350Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.9626781Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.9627202Z     T=1,
2025-05-07T20:32:55.9627382Z     D=7168,
2025-05-07T20:32:55.9627580Z     scale_ub=None,
2025-05-07T20:32:55.9627792Z     contiguous=True,
2025-05-07T20:32:55.9628016Z     compiled=True,
2025-05-07T20:32:55.9628214Z )
2025-05-07T20:32:55.9628538Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.9629047Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.9629326Z 
2025-05-07T20:32:55.9629403Z     @given(
2025-05-07T20:32:55.9629633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.9630049Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.9630366Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.9630710Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.9631044Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.9631341Z     )
2025-05-07T20:32:55.9631775Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.9632279Z     def test_silu_mul_quant(
2025-05-07T20:32:55.9632514Z         self,
2025-05-07T20:32:55.9632710Z         T: int,
2025-05-07T20:32:55.9632911Z         D: int,
2025-05-07T20:32:55.9633122Z         scale_ub: Optional[float],
2025-05-07T20:32:55.9633400Z         contiguous: bool,
2025-05-07T20:32:55.9633641Z         compiled: bool,
2025-05-07T20:32:55.9633898Z     ) -> None:
2025-05-07T20:32:55.9634112Z         torch.manual_seed(2025)
2025-05-07T20:32:55.9634354Z     
2025-05-07T20:32:55.9634622Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.9634980Z     
2025-05-07T20:32:55.9635172Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.9635461Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.9635777Z         x = x_sign * x_clamp
2025-05-07T20:32:55.9636019Z         x0 = x[:, :D]
2025-05-07T20:32:55.9636231Z         x1 = x[:, D:]
2025-05-07T20:32:55.9636446Z     
2025-05-07T20:32:55.9636629Z         if contiguous:
2025-05-07T20:32:55.9636856Z             x0 = x0.contiguous()
2025-05-07T20:32:55.9637118Z             x1 = x1.contiguous()
2025-05-07T20:32:55.9637359Z     
2025-05-07T20:32:55.9637546Z         if scale_ub is not None:
2025-05-07T20:32:55.9637818Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.9638159Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.9638482Z             )
2025-05-07T20:32:55.9638665Z         else:
2025-05-07T20:32:55.9638870Z             scale_ub_tensor = None
2025-05-07T20:32:55.9639128Z     
2025-05-07T20:32:55.9639354Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.9639675Z             op = silu_mul_quant
2025-05-07T20:32:55.9639933Z             if compiled:
2025-05-07T20:32:55.9640174Z                 op = torch.compile(op)
2025-05-07T20:32:55.9640476Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.9640765Z     
2025-05-07T20:32:55.9640953Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.9641248Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.9641549Z     
2025-05-07T20:32:55.9641786Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.9642134Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.9642435Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.9642757Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.9643125Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.9643453Z     
2025-05-07T20:32:55.9643707Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.9643910Z 
2025-05-07T20:32:55.9644009Z moe/activation_test.py:126: 
2025-05-07T20:32:55.9644316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9644664Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.9644998Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.9645853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.9646668Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.9647248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.9647974Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.9648717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.9649492Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.9650350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.9651190Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.9651973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.9652690Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.9653328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.9653873Z     fn()
2025-05-07T20:32:55.9654454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.9655072Z     self.fn.run(
2025-05-07T20:32:55.9655555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.9656118Z     kernel = self.compile(
2025-05-07T20:32:55.9656690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.9657390Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.9657798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.9658050Z 
2025-05-07T20:32:55.9658263Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1baa592b0>
2025-05-07T20:32:55.9659432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.9660950Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1ba9ff9d0>}
2025-05-07T20:32:55.9662421Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.9663539Z context = <triton._C.libtriton.ir.context object at 0x7fd1ba53ecb0>
2025-05-07T20:32:55.9663860Z 
2025-05-07T20:32:55.9664030Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.9664579Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.9665064Z                            module_map=module_map)
2025-05-07T20:32:55.9665444Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.9665812Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.9666077Z E       ^
2025-05-07T20:32:55.9666616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.9667117Z 
2025-05-07T20:32:55.9667565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.9668120Z 
2025-05-07T20:32:55.9668234Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.9668656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.9669079Z     T=4096,
2025-05-07T20:32:55.9669264Z     D=5120,
2025-05-07T20:32:55.9669452Z     scale_ub=None,
2025-05-07T20:32:55.9669668Z     contiguous=False,
2025-05-07T20:32:55.9669995Z     compiled=False,
2025-05-07T20:32:55.9670200Z )
2025-05-07T20:32:57.7180177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.7180763Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:57.7181083Z 
2025-05-07T20:32:57.7181190Z     @given(
2025-05-07T20:32:57.7181436Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.7181824Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.7182285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.7182934Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.7183532Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.7183846Z     )
2025-05-07T20:32:57.7184272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.7184743Z     def test_silu_mul_quant(
2025-05-07T20:32:57.7184996Z         self,
2025-05-07T20:32:57.7185198Z         T: int,
2025-05-07T20:32:57.7185394Z         D: int,
2025-05-07T20:32:57.7185621Z         scale_ub: Optional[float],
2025-05-07T20:32:57.7185966Z         contiguous: bool,
2025-05-07T20:32:57.7186208Z         compiled: bool,
2025-05-07T20:32:57.7186431Z     ) -> None:
2025-05-07T20:32:57.7186645Z         torch.manual_seed(2025)
2025-05-07T20:32:57.7186886Z     
2025-05-07T20:32:57.7187168Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.7187528Z     
2025-05-07T20:32:57.7187716Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.7188011Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.7188335Z         x = x_sign * x_clamp
2025-05-07T20:32:57.7188620Z         x0 = x[:, :D]
2025-05-07T20:32:57.7188840Z         x1 = x[:, D:]
2025-05-07T20:32:57.7189046Z     
2025-05-07T20:32:57.7189237Z         if contiguous:
2025-05-07T20:32:57.7189471Z             x0 = x0.contiguous()
2025-05-07T20:32:57.7189729Z             x1 = x1.contiguous()
2025-05-07T20:32:57.7190063Z     
2025-05-07T20:32:57.7190247Z         if scale_ub is not None:
2025-05-07T20:32:57.7190524Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.7190869Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.7191184Z             )
2025-05-07T20:32:57.7191380Z         else:
2025-05-07T20:32:57.7191589Z             scale_ub_tensor = None
2025-05-07T20:32:57.7191844Z     
2025-05-07T20:32:57.7192075Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.7192393Z             op = silu_mul_quant
2025-05-07T20:32:57.7192649Z             if compiled:
2025-05-07T20:32:57.7192898Z                 op = torch.compile(op)
2025-05-07T20:32:57.7193196Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.7193483Z     
2025-05-07T20:32:57.7193674Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.7193845Z 
2025-05-07T20:32:57.7193944Z moe/activation_test.py:117: 
2025-05-07T20:32:57.7194244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.7194591Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.7194879Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.7195618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.7196435Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.7197008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.7197732Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.7198444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.7199016Z     kernel = self.compile(
2025-05-07T20:32:57.7199595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.7200290Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.7200703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.7200950Z 
2025-05-07T20:32:57.7201171Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1ba5fa820>
2025-05-07T20:32:57.7202350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.7203905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1ba86fe50>}
2025-05-07T20:32:57.7205435Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.7206533Z context = <triton._C.libtriton.ir.context object at 0x7fd1b9fc5ef0>
2025-05-07T20:32:57.7206876Z 
2025-05-07T20:32:57.7207054Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.7207598Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.7208093Z                            module_map=module_map)
2025-05-07T20:32:57.7208469Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.7208832Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.7209090Z E       ^
2025-05-07T20:32:57.7209581Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.7210072Z 
2025-05-07T20:32:57.7210525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.7211076Z 
2025-05-07T20:32:57.7211183Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.7211606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.7212032Z     T=4096,
2025-05-07T20:32:57.7212223Z     D=7168,
2025-05-07T20:32:57.7212408Z     scale_ub=None,
2025-05-07T20:32:57.7212624Z     contiguous=False,
2025-05-07T20:32:57.7212854Z     compiled=False,
2025-05-07T20:32:57.7213051Z )
2025-05-07T20:32:57.7213376Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.7213896Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:57.7214186Z 
2025-05-07T20:32:57.7214260Z     @given(
2025-05-07T20:32:57.7214495Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.7214819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.7215136Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.7215479Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.7215818Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.7216106Z     )
2025-05-07T20:32:57.7216469Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.7216943Z     def test_silu_mul_quant(
2025-05-07T20:32:57.7223198Z         self,
2025-05-07T20:32:57.7223436Z         T: int,
2025-05-07T20:32:57.7223712Z         D: int,
2025-05-07T20:32:57.7223936Z         scale_ub: Optional[float],
2025-05-07T20:32:57.7224214Z         contiguous: bool,
2025-05-07T20:32:57.7224453Z         compiled: bool,
2025-05-07T20:32:57.7224688Z     ) -> None:
2025-05-07T20:32:57.7224918Z         torch.manual_seed(2025)
2025-05-07T20:32:57.7225161Z     
2025-05-07T20:32:57.7225449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.7225820Z     
2025-05-07T20:32:57.7226014Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.7226316Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.7226639Z         x = x_sign * x_clamp
2025-05-07T20:32:57.7226885Z         x0 = x[:, :D]
2025-05-07T20:32:57.7227106Z         x1 = x[:, D:]
2025-05-07T20:32:57.7227322Z     
2025-05-07T20:32:57.7227513Z         if contiguous:
2025-05-07T20:32:57.7227746Z             x0 = x0.contiguous()
2025-05-07T20:32:57.7228012Z             x1 = x1.contiguous()
2025-05-07T20:32:57.7228260Z     
2025-05-07T20:32:57.7228452Z         if scale_ub is not None:
2025-05-07T20:32:57.7228732Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.7229085Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.7229400Z             )
2025-05-07T20:32:57.7229597Z         else:
2025-05-07T20:32:57.7229985Z             scale_ub_tensor = None
2025-05-07T20:32:57.7230246Z     
2025-05-07T20:32:57.7230531Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.7230866Z             op = silu_mul_quant
2025-05-07T20:32:57.7231118Z             if compiled:
2025-05-07T20:32:57.7231378Z                 op = torch.compile(op)
2025-05-07T20:32:57.7231692Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.7231977Z     
2025-05-07T20:32:57.7232216Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.7232391Z 
2025-05-07T20:32:57.7232493Z moe/activation_test.py:117: 
2025-05-07T20:32:57.7232803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.7233147Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.7233440Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.7234184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.7234931Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.7235493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.7236230Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.7236937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.7237512Z     kernel = self.compile(
2025-05-07T20:32:57.7238078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.7238779Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.7239197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.7239440Z 
2025-05-07T20:32:57.7239660Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bab8f310>
2025-05-07T20:32:57.7240833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.7242347Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1ba5a8a60>}
2025-05-07T20:32:57.7243820Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.7244979Z context = <triton._C.libtriton.ir.context object at 0x7fd1ba37c670>
2025-05-07T20:32:57.7245288Z 
2025-05-07T20:32:57.7245456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.7246005Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.7246501Z                            module_map=module_map)
2025-05-07T20:32:57.7246880Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.7247239Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.7247510Z E       ^
2025-05-07T20:32:57.7248006Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.7248500Z 
2025-05-07T20:32:57.7248950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.7249513Z 
2025-05-07T20:32:57.7249617Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.7250050Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.7250472Z     T=128,
2025-05-07T20:32:57.7250654Z     D=7168,
2025-05-07T20:32:57.7250844Z     scale_ub=None,
2025-05-07T20:32:57.7251065Z     contiguous=False,
2025-05-07T20:32:57.7251331Z     compiled=True,
2025-05-07T20:32:57.7251537Z )
2025-05-07T20:32:57.7997398Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.7998038Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:57.7998336Z 
2025-05-07T20:32:57.7998411Z     @given(
2025-05-07T20:32:57.7998649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.7999110Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.7999552Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.7999903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.8000254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.8000554Z     )
2025-05-07T20:32:57.8000921Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.8001397Z     def test_silu_mul_quant(
2025-05-07T20:32:57.8001645Z         self,
2025-05-07T20:32:57.8001846Z         T: int,
2025-05-07T20:32:57.8002055Z         D: int,
2025-05-07T20:32:57.8002276Z         scale_ub: Optional[float],
2025-05-07T20:32:57.8002561Z         contiguous: bool,
2025-05-07T20:32:57.8002809Z         compiled: bool,
2025-05-07T20:32:57.8003045Z     ) -> None:
2025-05-07T20:32:57.8003260Z         torch.manual_seed(2025)
2025-05-07T20:32:57.8003514Z     
2025-05-07T20:32:57.8003798Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.8004158Z     
2025-05-07T20:32:57.8004356Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.8004657Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.8004977Z         x = x_sign * x_clamp
2025-05-07T20:32:57.8005229Z         x0 = x[:, :D]
2025-05-07T20:32:57.8005452Z         x1 = x[:, D:]
2025-05-07T20:32:57.8005662Z     
2025-05-07T20:32:57.8005854Z         if contiguous:
2025-05-07T20:32:57.8006092Z             x0 = x0.contiguous()
2025-05-07T20:32:57.8006355Z             x1 = x1.contiguous()
2025-05-07T20:32:57.8006608Z     
2025-05-07T20:32:57.8006808Z         if scale_ub is not None:
2025-05-07T20:32:57.8007086Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.8007442Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.8007771Z             )
2025-05-07T20:32:57.8007978Z         else:
2025-05-07T20:32:57.8008190Z             scale_ub_tensor = None
2025-05-07T20:32:57.8008454Z     
2025-05-07T20:32:57.8008692Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.8009023Z             op = silu_mul_quant
2025-05-07T20:32:57.8009280Z             if compiled:
2025-05-07T20:32:57.8009530Z                 op = torch.compile(op)
2025-05-07T20:32:57.8009903Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.8010186Z     
2025-05-07T20:32:57.8010378Z         y_fp8, y_scale = fn()
2025-05-07T20:32:57.8010663Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:57.8010959Z     
2025-05-07T20:32:57.8011200Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.8011539Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:57.8011842Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:57.8012169Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:57.8012541Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.8012856Z     
2025-05-07T20:32:57.8013055Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:57.8013261Z 
2025-05-07T20:32:57.8013367Z moe/activation_test.py:126: 
2025-05-07T20:32:57.8013668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.8014020Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:57.8014353Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.8015201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:57.8016088Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:57.8016673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.8017449Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.8018187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:57.8019005Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:57.8019820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:57.8020628Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:57.8021413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:57.8022111Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:57.8022761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:57.8023326Z     fn()
2025-05-07T20:32:57.8023863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:57.8024488Z     self.fn.run(
2025-05-07T20:32:57.8024978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.8025546Z     kernel = self.compile(
2025-05-07T20:32:57.8026124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.8026827Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.8027242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.8027493Z 
2025-05-07T20:32:57.8027709Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1ba4a19a0>
2025-05-07T20:32:57.8028894Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.8030542Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1ba5e2550>}
2025-05-07T20:32:57.8032092Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.8033202Z context = <triton._C.libtriton.ir.context object at 0x7fd1b9ed79b0>
2025-05-07T20:32:57.8033513Z 
2025-05-07T20:32:57.8033682Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.8034235Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.8034732Z                            module_map=module_map)
2025-05-07T20:32:57.8035104Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.8035468Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:57.8035741Z E       ^
2025-05-07T20:32:57.8036230Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.8036732Z 
2025-05-07T20:32:57.8037184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.8037749Z 
2025-05-07T20:32:57.8037854Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.8038280Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.8038700Z     T=128,
2025-05-07T20:32:57.8038886Z     D=7168,
2025-05-07T20:32:57.8039146Z     scale_ub=None,
2025-05-07T20:32:57.8039359Z     contiguous=False,
2025-05-07T20:32:57.8039625Z     compiled=False,
2025-05-07T20:32:57.8039834Z )
2025-05-07T20:32:58.2018815Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.2020147Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.2020757Z 
2025-05-07T20:32:58.2020841Z     @given(
2025-05-07T20:32:58.2021215Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.2021541Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.2021861Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.2022212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.2022545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.2022842Z     )
2025-05-07T20:32:58.2023215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.2023681Z     def test_silu_mul_quant(
2025-05-07T20:32:58.2023929Z         self,
2025-05-07T20:32:58.2024125Z         T: int,
2025-05-07T20:32:58.2024320Z         D: int,
2025-05-07T20:32:58.2024545Z         scale_ub: Optional[float],
2025-05-07T20:32:58.2024828Z         contiguous: bool,
2025-05-07T20:32:58.2025070Z         compiled: bool,
2025-05-07T20:32:58.2025289Z     ) -> None:
2025-05-07T20:32:58.2025507Z         torch.manual_seed(2025)
2025-05-07T20:32:58.2025757Z     
2025-05-07T20:32:58.2026032Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.2026393Z     
2025-05-07T20:32:58.2026594Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.2026888Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.2027208Z         x = x_sign * x_clamp
2025-05-07T20:32:58.2027452Z         x0 = x[:, :D]
2025-05-07T20:32:58.2027669Z         x1 = x[:, D:]
2025-05-07T20:32:58.2027880Z     
2025-05-07T20:32:58.2028059Z         if contiguous:
2025-05-07T20:32:58.2028285Z             x0 = x0.contiguous()
2025-05-07T20:32:58.2028549Z             x1 = x1.contiguous()
2025-05-07T20:32:58.2028790Z     
2025-05-07T20:32:58.2028974Z         if scale_ub is not None:
2025-05-07T20:32:58.2029249Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.2029590Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.2030089Z             )
2025-05-07T20:32:58.2030283Z         else:
2025-05-07T20:32:58.2030495Z             scale_ub_tensor = None
2025-05-07T20:32:58.2030753Z     
2025-05-07T20:32:58.2030979Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.2031302Z             op = silu_mul_quant
2025-05-07T20:32:58.2031633Z             if compiled:
2025-05-07T20:32:58.2031877Z                 op = torch.compile(op)
2025-05-07T20:32:58.2032175Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2032455Z     
2025-05-07T20:32:58.2032636Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.2032807Z 
2025-05-07T20:32:58.2032906Z moe/activation_test.py:117: 
2025-05-07T20:32:58.2033203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2033541Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.2033823Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2034561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.2035311Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.2035873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.2036605Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.2037313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.2037872Z     kernel = self.compile(
2025-05-07T20:32:58.2038507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.2039257Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.2039672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2039912Z 
2025-05-07T20:32:58.2040124Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1ba2be4c0>
2025-05-07T20:32:58.2041294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.2042849Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1ba5a8ee0>}
2025-05-07T20:32:58.2044318Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.2045427Z context = <triton._C.libtriton.ir.context object at 0x7fd1b9ea39f0>
2025-05-07T20:32:58.2045732Z 
2025-05-07T20:32:58.2045900Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.2046445Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.2046934Z                            module_map=module_map)
2025-05-07T20:32:58.2047302Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.2047660Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.2047927Z E       ^
2025-05-07T20:32:58.2048418Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.2048905Z 
2025-05-07T20:32:58.2049353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.2049913Z 
2025-05-07T20:32:58.2050013Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.2050440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.2050863Z     T=4096,
2025-05-07T20:32:58.2051040Z     D=5120,
2025-05-07T20:32:58.2051230Z     scale_ub=1200.0,
2025-05-07T20:32:58.2051456Z     contiguous=True,
2025-05-07T20:32:58.2051672Z     compiled=False,
2025-05-07T20:32:58.2051879Z )
2025-05-07T20:32:58.2052201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.2052761Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:58.2053059Z 
2025-05-07T20:32:58.2053134Z     @given(
2025-05-07T20:32:58.2053362Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.2053675Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.2053989Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.2054331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.2054669Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.2054958Z     )
2025-05-07T20:32:58.2055318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.2055781Z     def test_silu_mul_quant(
2025-05-07T20:32:58.2056019Z         self,
2025-05-07T20:32:58.2056214Z         T: int,
2025-05-07T20:32:58.2056408Z         D: int,
2025-05-07T20:32:58.2056620Z         scale_ub: Optional[float],
2025-05-07T20:32:58.2056893Z         contiguous: bool,
2025-05-07T20:32:58.2057134Z         compiled: bool,
2025-05-07T20:32:58.2057355Z     ) -> None:
2025-05-07T20:32:58.2057570Z         torch.manual_seed(2025)
2025-05-07T20:32:58.2057815Z     
2025-05-07T20:32:58.2058081Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.2058434Z     
2025-05-07T20:32:58.2058624Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.2058958Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.2059277Z         x = x_sign * x_clamp
2025-05-07T20:32:58.2059555Z         x0 = x[:, :D]
2025-05-07T20:32:58.2059771Z         x1 = x[:, D:]
2025-05-07T20:32:58.2059972Z     
2025-05-07T20:32:58.2060154Z         if contiguous:
2025-05-07T20:32:58.2060387Z             x0 = x0.contiguous()
2025-05-07T20:32:58.2060638Z             x1 = x1.contiguous()
2025-05-07T20:32:58.2060877Z     
2025-05-07T20:32:58.2061111Z         if scale_ub is not None:
2025-05-07T20:32:58.2061382Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.2061727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.2062050Z             )
2025-05-07T20:32:58.2062243Z         else:
2025-05-07T20:32:58.2062456Z             scale_ub_tensor = None
2025-05-07T20:32:58.2062706Z     
2025-05-07T20:32:58.2062927Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.2063247Z             op = silu_mul_quant
2025-05-07T20:32:58.2063496Z             if compiled:
2025-05-07T20:32:58.2063735Z                 op = torch.compile(op)
2025-05-07T20:32:58.2064038Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2064318Z     
2025-05-07T20:32:58.2064498Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.2064668Z 
2025-05-07T20:32:58.2064765Z moe/activation_test.py:117: 
2025-05-07T20:32:58.2065066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2065413Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.2065693Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2066434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.2067175Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.2067734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.2068466Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.2069171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.2069735Z     kernel = self.compile(
2025-05-07T20:32:58.2070391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.2071092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.2071505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2071745Z 
2025-05-07T20:32:58.2072007Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b9ebfcd0>
2025-05-07T20:32:58.2073166Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.2074667Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1ba1350d0>}
2025-05-07T20:32:58.2076140Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.2077248Z context = <triton._C.libtriton.ir.context object at 0x7fd1b99b8870>
2025-05-07T20:32:58.2077552Z 
2025-05-07T20:32:58.2077718Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.2078263Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.2078749Z                            module_map=module_map)
2025-05-07T20:32:58.2079126Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.2079483Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.2079839Z E       ^
2025-05-07T20:32:58.2080329Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.2080854Z 
2025-05-07T20:32:58.2081308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.2081862Z 
2025-05-07T20:32:58.2081962Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.2082426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.2083009Z     T=1,
2025-05-07T20:32:58.2083185Z     D=5120,
2025-05-07T20:32:58.2083378Z     scale_ub=None,
2025-05-07T20:32:58.2083588Z     contiguous=True,
2025-05-07T20:32:58.2083803Z     compiled=True,
2025-05-07T20:32:58.2084001Z )
2025-05-07T20:32:58.8597294Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.8597851Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.8598214Z 
2025-05-07T20:32:58.8598339Z     @given(
2025-05-07T20:32:58.8598695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.8599149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.8599592Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.8600031Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.8600470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.8600855Z     )
2025-05-07T20:32:58.8601230Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.8601700Z     def test_silu_mul_quant(
2025-05-07T20:32:58.8601947Z         self,
2025-05-07T20:32:58.8602139Z         T: int,
2025-05-07T20:32:58.8602341Z         D: int,
2025-05-07T20:32:58.8602563Z         scale_ub: Optional[float],
2025-05-07T20:32:58.8602835Z         contiguous: bool,
2025-05-07T20:32:58.8603081Z         compiled: bool,
2025-05-07T20:32:58.8603314Z     ) -> None:
2025-05-07T20:32:58.8603531Z         torch.manual_seed(2025)
2025-05-07T20:32:58.8603782Z     
2025-05-07T20:32:58.8604057Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.8604410Z     
2025-05-07T20:32:58.8604614Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.8604935Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.8605281Z         x = x_sign * x_clamp
2025-05-07T20:32:58.8605538Z         x0 = x[:, :D]
2025-05-07T20:32:58.8605766Z         x1 = x[:, D:]
2025-05-07T20:32:58.8605987Z     
2025-05-07T20:32:58.8606180Z         if contiguous:
2025-05-07T20:32:58.8606428Z             x0 = x0.contiguous()
2025-05-07T20:32:58.8606863Z             x1 = x1.contiguous()
2025-05-07T20:32:58.8607105Z     
2025-05-07T20:32:58.8607297Z         if scale_ub is not None:
2025-05-07T20:32:58.8613098Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.8613495Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.8613830Z             )
2025-05-07T20:32:58.8614028Z         else:
2025-05-07T20:32:58.8614248Z             scale_ub_tensor = None
2025-05-07T20:32:58.8614513Z     
2025-05-07T20:32:58.8614753Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.8615100Z             op = silu_mul_quant
2025-05-07T20:32:58.8615364Z             if compiled:
2025-05-07T20:32:58.8615639Z                 op = torch.compile(op)
2025-05-07T20:32:58.8615947Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.8616235Z     
2025-05-07T20:32:58.8616432Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.8616718Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.8617027Z     
2025-05-07T20:32:58.8617266Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.8617612Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.8617924Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.8618352Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.8618726Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.8619107Z     
2025-05-07T20:32:58.8619311Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.8619513Z 
2025-05-07T20:32:58.8619622Z moe/activation_test.py:126: 
2025-05-07T20:32:58.8619923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8620271Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.8620668Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.8621521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.8622335Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.8622912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.8623653Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.8624389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.8625180Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.8625987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:58.8626796Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.8627576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.8628256Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.8628896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.8629440Z     fn()
2025-05-07T20:32:58.8630168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.8630844Z     self.fn.run(
2025-05-07T20:32:58.8631329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.8631890Z     kernel = self.compile(
2025-05-07T20:32:58.8632453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.8633155Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.8633619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.8633862Z 
2025-05-07T20:32:58.8634074Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b9e0e490>
2025-05-07T20:32:58.8635245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.8636756Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b9dfb4c0>}
2025-05-07T20:32:58.8638228Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.8639343Z context = <triton._C.libtriton.ir.context object at 0x7fd1b9ddadb0>
2025-05-07T20:32:58.8639648Z 
2025-05-07T20:32:58.8639821Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.8640368Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.8640906Z                            module_map=module_map)
2025-05-07T20:32:58.8641324Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.8641686Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.8641995Z E       ^
2025-05-07T20:32:58.8642485Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.8642975Z 
2025-05-07T20:32:58.8643423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.8644024Z 
2025-05-07T20:32:58.8644125Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.8644553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.8644980Z     T=2048,
2025-05-07T20:32:58.8645164Z     D=5120,
2025-05-07T20:32:58.8645353Z     scale_ub=None,
2025-05-07T20:32:58.8645564Z     contiguous=True,
2025-05-07T20:32:58.8645777Z     compiled=True,
2025-05-07T20:32:58.8645975Z )
2025-05-07T20:32:59.4751396Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.4752909Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:59.4753547Z 
2025-05-07T20:32:59.4753725Z     @given(
2025-05-07T20:32:59.4754187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.4754842Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.4755481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.4756162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.4756852Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.4757455Z     )
2025-05-07T20:32:59.4758198Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.4759133Z     def test_silu_mul_quant(
2025-05-07T20:32:59.4759636Z         self,
2025-05-07T20:32:59.4760030Z         T: int,
2025-05-07T20:32:59.4760428Z         D: int,
2025-05-07T20:32:59.4760870Z         scale_ub: Optional[float],
2025-05-07T20:32:59.4761167Z         contiguous: bool,
2025-05-07T20:32:59.4761415Z         compiled: bool,
2025-05-07T20:32:59.4761649Z     ) -> None:
2025-05-07T20:32:59.4761868Z         torch.manual_seed(2025)
2025-05-07T20:32:59.4762112Z     
2025-05-07T20:32:59.4762392Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.4762755Z     
2025-05-07T20:32:59.4762942Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.4763233Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.4763560Z         x = x_sign * x_clamp
2025-05-07T20:32:59.4763799Z         x0 = x[:, :D]
2025-05-07T20:32:59.4764017Z         x1 = x[:, D:]
2025-05-07T20:32:59.4764230Z     
2025-05-07T20:32:59.4764545Z         if contiguous:
2025-05-07T20:32:59.4764778Z             x0 = x0.contiguous()
2025-05-07T20:32:59.4765039Z             x1 = x1.contiguous()
2025-05-07T20:32:59.4765291Z     
2025-05-07T20:32:59.4765475Z         if scale_ub is not None:
2025-05-07T20:32:59.4765750Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.4766095Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.4766412Z             )
2025-05-07T20:32:59.4766599Z         else:
2025-05-07T20:32:59.4766804Z             scale_ub_tensor = None
2025-05-07T20:32:59.4767050Z     
2025-05-07T20:32:59.4767279Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.4767597Z             op = silu_mul_quant
2025-05-07T20:32:59.4767841Z             if compiled:
2025-05-07T20:32:59.4768085Z                 op = torch.compile(op)
2025-05-07T20:32:59.4768382Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.4768654Z     
2025-05-07T20:32:59.4768843Z         y_fp8, y_scale = fn()
2025-05-07T20:32:59.4769128Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:59.4769422Z     
2025-05-07T20:32:59.4769656Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.4769996Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:59.4770358Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:59.4770674Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:59.4771142Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:59.4771459Z     
2025-05-07T20:32:59.4771647Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:59.4771850Z 
2025-05-07T20:32:59.4771948Z moe/activation_test.py:126: 
2025-05-07T20:32:59.4772245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.4772649Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:59.4772981Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:59.4773830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:59.4774646Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:59.4775215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.4775948Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.4776684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:59.4777451Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:59.4778249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:59.4779053Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:59.4779832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:59.4780517Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:59.4781198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:59.4781750Z     fn()
2025-05-07T20:32:59.4782288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:59.4783095Z     self.fn.run(
2025-05-07T20:32:59.4783584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.4784146Z     kernel = self.compile(
2025-05-07T20:32:59.4784718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.4785483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.4785892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.4786131Z 
2025-05-07T20:32:59.4786347Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b9a81c40>
2025-05-07T20:32:59.4787508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.4789017Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b9a47f70>}
2025-05-07T20:32:59.4790604Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.4791716Z context = <triton._C.libtriton.ir.context object at 0x7fd1b9c4b5f0>
2025-05-07T20:32:59.4792019Z 
2025-05-07T20:32:59.4792191Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.4792734Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.4793288Z                            module_map=module_map)
2025-05-07T20:32:59.4793666Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.4794079Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:59.4794346Z E       ^
2025-05-07T20:32:59.4794830Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.4795314Z 
2025-05-07T20:32:59.4795766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.4796376Z 
2025-05-07T20:32:59.4796476Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.4796903Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.4797320Z     T=128,
2025-05-07T20:32:59.4797496Z     D=5120,
2025-05-07T20:32:59.4797685Z     scale_ub=None,
2025-05-07T20:32:59.4797895Z     contiguous=True,
2025-05-07T20:32:59.4798114Z     compiled=True,
2025-05-07T20:32:59.4798309Z )
2025-05-07T20:33:00.4517228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.4518002Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.4518397Z 
2025-05-07T20:33:00.4518511Z     @given(
2025-05-07T20:33:00.4518821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.4519267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.4519597Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.4519953Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.4520294Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.4520601Z     )
2025-05-07T20:33:00.4520973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.4521449Z     def test_silu_mul_quant(
2025-05-07T20:33:00.4521701Z         self,
2025-05-07T20:33:00.4521902Z         T: int,
2025-05-07T20:33:00.4522103Z         D: int,
2025-05-07T20:33:00.4522337Z         scale_ub: Optional[float],
2025-05-07T20:33:00.4522633Z         contiguous: bool,
2025-05-07T20:33:00.4522880Z         compiled: bool,
2025-05-07T20:33:00.4523118Z     ) -> None:
2025-05-07T20:33:00.4523340Z         torch.manual_seed(2025)
2025-05-07T20:33:00.4523589Z     
2025-05-07T20:33:00.4523872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.4524237Z     
2025-05-07T20:33:00.4524435Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.4524734Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.4525057Z         x = x_sign * x_clamp
2025-05-07T20:33:00.4525432Z         x0 = x[:, :D]
2025-05-07T20:33:00.4525664Z         x1 = x[:, D:]
2025-05-07T20:33:00.4525882Z     
2025-05-07T20:33:00.4526072Z         if contiguous:
2025-05-07T20:33:00.4526318Z             x0 = x0.contiguous()
2025-05-07T20:33:00.4526593Z             x1 = x1.contiguous()
2025-05-07T20:33:00.4526843Z     
2025-05-07T20:33:00.4527042Z         if scale_ub is not None:
2025-05-07T20:33:00.4527326Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.4527676Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.4528020Z             )
2025-05-07T20:33:00.4528209Z         else:
2025-05-07T20:33:00.4528422Z             scale_ub_tensor = None
2025-05-07T20:33:00.4528683Z     
2025-05-07T20:33:00.4528911Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.4529245Z             op = silu_mul_quant
2025-05-07T20:33:00.4529504Z             if compiled:
2025-05-07T20:33:00.4529752Z                 op = torch.compile(op)
2025-05-07T20:33:00.4530066Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.4530352Z     
2025-05-07T20:33:00.4530542Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.4530835Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.4531137Z     
2025-05-07T20:33:00.4531372Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.4531788Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.4532145Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.4532466Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.4532827Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.4533144Z     
2025-05-07T20:33:00.4533342Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.4533600Z 
2025-05-07T20:33:00.4533701Z moe/activation_test.py:126: 
2025-05-07T20:33:00.4534001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.4534348Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.4534680Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.4535532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.4536353Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.4536939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.4537670Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.4538410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.4539186Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.4540002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:00.4540804Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.4541593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.4542282Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.4542921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.4543474Z     fn()
2025-05-07T20:33:00.4544009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.4544629Z     self.fn.run(
2025-05-07T20:33:00.4545112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.4545682Z     kernel = self.compile(
2025-05-07T20:33:00.4546301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.4546999Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.4547407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.4547656Z 
2025-05-07T20:33:00.4547873Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b94f5af0>
2025-05-07T20:33:00.4549050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.4550736Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b9d0f0d0>}
2025-05-07T20:33:00.4552218Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.4553329Z context = <triton._C.libtriton.ir.context object at 0x7fd1b978b0b0>
2025-05-07T20:33:00.4553641Z 
2025-05-07T20:33:00.4553810Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.4554405Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.4554926Z                            module_map=module_map)
2025-05-07T20:33:00.4555297Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.4555662Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.4555925Z E       ^
2025-05-07T20:33:00.4556414Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.4556977Z 
2025-05-07T20:33:00.4557426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.4557987Z 
2025-05-07T20:33:00.4558095Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.4558516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.4558937Z     T=4096,
2025-05-07T20:33:00.4559123Z     D=5120,
2025-05-07T20:33:00.4559311Z     scale_ub=None,
2025-05-07T20:33:00.4559529Z     contiguous=True,
2025-05-07T20:33:00.4559749Z     compiled=True,
2025-05-07T20:33:00.4559947Z )
2025-05-07T20:33:01.2942232Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2942943Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.2943393Z 
2025-05-07T20:33:01.2943513Z     @given(
2025-05-07T20:33:01.2943850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2944310Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2944656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2945009Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2945346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2945648Z     )
2025-05-07T20:33:01.2946027Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2946498Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2946747Z         self,
2025-05-07T20:33:01.2946945Z         T: int,
2025-05-07T20:33:01.2947139Z         D: int,
2025-05-07T20:33:01.2947366Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2947657Z         contiguous: bool,
2025-05-07T20:33:01.2947904Z         compiled: bool,
2025-05-07T20:33:01.2948131Z     ) -> None:
2025-05-07T20:33:01.2948355Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2948606Z     
2025-05-07T20:33:01.2948882Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2949252Z     
2025-05-07T20:33:01.2949453Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.2949752Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.2950325Z         x = x_sign * x_clamp
2025-05-07T20:33:01.2950589Z         x0 = x[:, :D]
2025-05-07T20:33:01.2950805Z         x1 = x[:, D:]
2025-05-07T20:33:01.2951017Z     
2025-05-07T20:33:01.2951210Z         if contiguous:
2025-05-07T20:33:01.2951445Z             x0 = x0.contiguous()
2025-05-07T20:33:01.2951727Z             x1 = x1.contiguous()
2025-05-07T20:33:01.2951987Z     
2025-05-07T20:33:01.2952180Z         if scale_ub is not None:
2025-05-07T20:33:01.2952463Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.2952817Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.2953135Z             )
2025-05-07T20:33:01.2953332Z         else:
2025-05-07T20:33:01.2953548Z             scale_ub_tensor = None
2025-05-07T20:33:01.2953808Z     
2025-05-07T20:33:01.2954033Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.2954360Z             op = silu_mul_quant
2025-05-07T20:33:01.2954611Z             if compiled:
2025-05-07T20:33:01.2954851Z                 op = torch.compile(op)
2025-05-07T20:33:01.2955152Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.2955431Z     
2025-05-07T20:33:01.2955612Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.2955897Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.2956194Z     
2025-05-07T20:33:01.2956493Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.2956894Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.2957196Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.2957512Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.2957881Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.2958268Z     
2025-05-07T20:33:01.2958467Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.2958667Z 
2025-05-07T20:33:01.2958765Z moe/activation_test.py:126: 
2025-05-07T20:33:01.2959071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2959414Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.2959743Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.2960594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.2961456Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.2962034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.2962757Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.2963494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.2964271Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.2965078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:01.2965873Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.2966659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.2967343Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.2967978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.2968532Z     fn()
2025-05-07T20:33:01.2969069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.2969689Z     self.fn.run(
2025-05-07T20:33:01.2970170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.2970735Z     kernel = self.compile(
2025-05-07T20:33:01.2971351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.2972041Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.2972451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2972698Z 
2025-05-07T20:33:01.2972910Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b9681ac0>
2025-05-07T20:33:01.2974085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.2975599Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b981ddc0>}
2025-05-07T20:33:01.2977071Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.2978178Z context = <triton._C.libtriton.ir.context object at 0x7fd1b93743f0>
2025-05-07T20:33:01.2978482Z 
2025-05-07T20:33:01.2978706Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.2979256Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.2979778Z                            module_map=module_map)
2025-05-07T20:33:01.2980151Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.2980515Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.2980781Z E       ^
2025-05-07T20:33:01.2981309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.2981796Z 
2025-05-07T20:33:01.2982256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.2982995Z 
2025-05-07T20:33:01.2983106Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2983528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2983956Z     T=16384,
2025-05-07T20:33:01.2984158Z     D=5120,
2025-05-07T20:33:01.2984352Z     scale_ub=None,
2025-05-07T20:33:01.2984572Z     contiguous=True,
2025-05-07T20:33:01.2984799Z     compiled=True,
2025-05-07T20:33:01.2985000Z )
2025-05-07T20:33:01.3414956Z W0507 20:33:01.339931 88025 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:01.3422287Z W0507 20:33:01.339931 88025 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:01.3423859Z W0507 20:33:01.339931 88025 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:01.3424956Z W0507 20:33:01.339931 88025 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:01.3426170Z W0507 20:33:01.339931 88025 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:01.4634024Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.4634687Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.4635145Z 
2025-05-07T20:33:01.4635297Z     @given(
2025-05-07T20:33:01.4635641Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.4636090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.4636527Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.4636877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.4637219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.4637532Z     )
2025-05-07T20:33:01.4637901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.4638380Z     def test_silu_mul_quant(
2025-05-07T20:33:01.4638633Z         self,
2025-05-07T20:33:01.4638827Z         T: int,
2025-05-07T20:33:01.4639034Z         D: int,
2025-05-07T20:33:01.4639261Z         scale_ub: Optional[float],
2025-05-07T20:33:01.4639539Z         contiguous: bool,
2025-05-07T20:33:01.4639789Z         compiled: bool,
2025-05-07T20:33:01.4640024Z     ) -> None:
2025-05-07T20:33:01.4640244Z         torch.manual_seed(2025)
2025-05-07T20:33:01.4640498Z     
2025-05-07T20:33:01.4640789Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.4641171Z     
2025-05-07T20:33:01.4641397Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.4641705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.4642027Z         x = x_sign * x_clamp
2025-05-07T20:33:01.4642267Z         x0 = x[:, :D]
2025-05-07T20:33:01.4642483Z         x1 = x[:, D:]
2025-05-07T20:33:01.4642688Z     
2025-05-07T20:33:01.4642941Z         if contiguous:
2025-05-07T20:33:01.4643178Z             x0 = x0.contiguous()
2025-05-07T20:33:01.4643496Z             x1 = x1.contiguous()
2025-05-07T20:33:01.4643739Z     
2025-05-07T20:33:01.4643925Z         if scale_ub is not None:
2025-05-07T20:33:01.4644204Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.4644546Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.4644859Z             )
2025-05-07T20:33:01.4645112Z         else:
2025-05-07T20:33:01.4645323Z             scale_ub_tensor = None
2025-05-07T20:33:01.4645573Z     
2025-05-07T20:33:01.4645805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.4646134Z             op = silu_mul_quant
2025-05-07T20:33:01.4646382Z             if compiled:
2025-05-07T20:33:01.4646636Z                 op = torch.compile(op)
2025-05-07T20:33:01.4646938Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.4647226Z     
2025-05-07T20:33:01.4647412Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.4647701Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.4648001Z     
2025-05-07T20:33:01.4648234Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.4648578Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.4648879Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.4649195Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.4649569Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.4649891Z     
2025-05-07T20:33:01.4650086Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.4650291Z 
2025-05-07T20:33:01.4650392Z moe/activation_test.py:126: 
2025-05-07T20:33:01.4650695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.4651043Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.4651371Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.4652226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.4653047Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.4653619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.4654353Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.4655094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.4655912Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.4656717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:01.4657518Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.4658305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.4658990Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.4659627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.4660181Z     fn()
2025-05-07T20:33:01.4660720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.4661340Z     self.fn.run(
2025-05-07T20:33:01.4661833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.4662400Z     kernel = self.compile(
2025-05-07T20:33:01.4662971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.4663664Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.4664141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.4664450Z 
2025-05-07T20:33:01.4664668Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b9db5eb0>
2025-05-07T20:33:01.4665841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.4667392Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b9aed310>}
2025-05-07T20:33:01.4668861Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.4670122Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8fc7130>
2025-05-07T20:33:01.4670429Z 
2025-05-07T20:33:01.4670613Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.4671181Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.4671698Z                            module_map=module_map)
2025-05-07T20:33:01.4672073Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.4672442Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.4672711Z E       ^
2025-05-07T20:33:01.4673202Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.4673691Z 
2025-05-07T20:33:01.4674141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.4674696Z 
2025-05-07T20:33:01.4674801Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.4675227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.4675655Z     T=1,
2025-05-07T20:33:01.4675837Z     D=5120,
2025-05-07T20:33:01.4676027Z     scale_ub=1200.0,
2025-05-07T20:33:01.4676252Z     contiguous=True,
2025-05-07T20:33:01.4676475Z     compiled=True,
2025-05-07T20:33:01.4676673Z )
2025-05-07T20:33:01.6380583Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.6381496Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.6382358Z 
2025-05-07T20:33:01.6382604Z     @given(
2025-05-07T20:33:01.6383560Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.6384706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.6385338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.6386013Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.6386674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.6387260Z     )
2025-05-07T20:33:01.6387976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.6388895Z     def test_silu_mul_quant(
2025-05-07T20:33:01.6389376Z         self,
2025-05-07T20:33:01.6389753Z         T: int,
2025-05-07T20:33:01.6390318Z         D: int,
2025-05-07T20:33:01.6390747Z         scale_ub: Optional[float],
2025-05-07T20:33:01.6391289Z         contiguous: bool,
2025-05-07T20:33:01.6391671Z         compiled: bool,
2025-05-07T20:33:01.6391902Z     ) -> None:
2025-05-07T20:33:01.6392121Z         torch.manual_seed(2025)
2025-05-07T20:33:01.6392363Z     
2025-05-07T20:33:01.6392647Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.6393002Z     
2025-05-07T20:33:01.6393195Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.6393487Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.6393803Z         x = x_sign * x_clamp
2025-05-07T20:33:01.6394048Z         x0 = x[:, :D]
2025-05-07T20:33:01.6394326Z         x1 = x[:, D:]
2025-05-07T20:33:01.6394538Z     
2025-05-07T20:33:01.6394778Z         if contiguous:
2025-05-07T20:33:01.6395008Z             x0 = x0.contiguous()
2025-05-07T20:33:01.6395271Z             x1 = x1.contiguous()
2025-05-07T20:33:01.6395513Z     
2025-05-07T20:33:01.6395701Z         if scale_ub is not None:
2025-05-07T20:33:01.6395979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.6396321Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.6396703Z             )
2025-05-07T20:33:01.6396901Z         else:
2025-05-07T20:33:01.6397117Z             scale_ub_tensor = None
2025-05-07T20:33:01.6397370Z     
2025-05-07T20:33:01.6397606Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.6397935Z             op = silu_mul_quant
2025-05-07T20:33:01.6398186Z             if compiled:
2025-05-07T20:33:01.6398444Z                 op = torch.compile(op)
2025-05-07T20:33:01.6398754Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.6399042Z     
2025-05-07T20:33:01.6399233Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.6399411Z 
2025-05-07T20:33:01.6399511Z moe/activation_test.py:117: 
2025-05-07T20:33:01.6399824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.6400170Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.6400466Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.6401066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.6401657Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.6402364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.6403105Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.6403673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.6404397Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.6405104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.6405671Z     kernel = self.compile(
2025-05-07T20:33:01.6406243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.6406937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.6407349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.6407588Z 
2025-05-07T20:33:01.6407855Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b9887a30>
2025-05-07T20:33:01.6409033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.6410538Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b90549d0>}
2025-05-07T20:33:01.6412015Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.6413123Z context = <triton._C.libtriton.ir.context object at 0x7fd1b89bcd70>
2025-05-07T20:33:01.6413428Z 
2025-05-07T20:33:01.6413607Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.6414155Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.6414646Z                            module_map=module_map)
2025-05-07T20:33:01.6415023Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.6415429Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.6415692Z E       ^
2025-05-07T20:33:01.6416224Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.6416714Z 
2025-05-07T20:33:01.6417169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.6417725Z 
2025-05-07T20:33:01.6417871Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.6418299Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.6418723Z     T=1,
2025-05-07T20:33:01.6418907Z     D=5120,
2025-05-07T20:33:01.6419097Z     scale_ub=None,
2025-05-07T20:33:01.6419325Z     contiguous=False,
2025-05-07T20:33:01.6419558Z     compiled=True,
2025-05-07T20:33:01.6419762Z )
2025-05-07T20:33:01.7227011Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.7227873Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.7228267Z 
2025-05-07T20:33:01.7228388Z     @given(
2025-05-07T20:33:01.7228629Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.7228956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.7229272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.7229608Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.7230016Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.7230319Z     )
2025-05-07T20:33:01.7230671Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.7231146Z     def test_silu_mul_quant(
2025-05-07T20:33:01.7231392Z         self,
2025-05-07T20:33:01.7231576Z         T: int,
2025-05-07T20:33:01.7231774Z         D: int,
2025-05-07T20:33:01.7231991Z         scale_ub: Optional[float],
2025-05-07T20:33:01.7232259Z         contiguous: bool,
2025-05-07T20:33:01.7232496Z         compiled: bool,
2025-05-07T20:33:01.7232723Z     ) -> None:
2025-05-07T20:33:01.7232928Z         torch.manual_seed(2025)
2025-05-07T20:33:01.7233172Z     
2025-05-07T20:33:01.7233443Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.7233789Z     
2025-05-07T20:33:01.7233974Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.7234270Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.7234590Z         x = x_sign * x_clamp
2025-05-07T20:33:01.7234833Z         x0 = x[:, :D]
2025-05-07T20:33:01.7235054Z         x1 = x[:, D:]
2025-05-07T20:33:01.7235258Z     
2025-05-07T20:33:01.7235435Z         if contiguous:
2025-05-07T20:33:01.7235778Z             x0 = x0.contiguous()
2025-05-07T20:33:01.7236041Z             x1 = x1.contiguous()
2025-05-07T20:33:01.7236278Z     
2025-05-07T20:33:01.7236471Z         if scale_ub is not None:
2025-05-07T20:33:01.7236748Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.7237088Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.7237407Z             )
2025-05-07T20:33:01.7237596Z         else:
2025-05-07T20:33:01.7237798Z             scale_ub_tensor = None
2025-05-07T20:33:01.7238053Z     
2025-05-07T20:33:01.7238284Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.7238602Z             op = silu_mul_quant
2025-05-07T20:33:01.7238856Z             if compiled:
2025-05-07T20:33:01.7239103Z                 op = torch.compile(op)
2025-05-07T20:33:01.7239409Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.7239686Z     
2025-05-07T20:33:01.7239882Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.7240177Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.7240467Z     
2025-05-07T20:33:01.7240706Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.7241053Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.7241348Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.7241734Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.7242109Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.7242480Z     
2025-05-07T20:33:01.7242682Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.7242887Z 
2025-05-07T20:33:01.7242984Z moe/activation_test.py:126: 
2025-05-07T20:33:01.7243288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.7243691Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.7244024Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.7244875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.7245683Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.7246261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.7246995Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.7247733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.7248499Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.7249301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:01.7250106Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.7250892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.7251572Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.7252217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.7252771Z     fn()
2025-05-07T20:33:01.7253305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.7253927Z     self.fn.run(
2025-05-07T20:33:01.7254419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.7254986Z     kernel = self.compile(
2025-05-07T20:33:01.7255550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.7256247Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.7256710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.7256955Z 
2025-05-07T20:33:01.7257176Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8a06190>
2025-05-07T20:33:01.7258346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.7259864Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b97ede50>}
2025-05-07T20:33:01.7261384Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.7262505Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8918d70>
2025-05-07T20:33:01.7262814Z 
2025-05-07T20:33:01.7262980Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.7263527Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.7264015Z                            module_map=module_map)
2025-05-07T20:33:01.7264465Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.7264872Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.7265150Z E       ^
2025-05-07T20:33:01.7265640Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.7266122Z 
2025-05-07T20:33:01.7266567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.7267173Z 
2025-05-07T20:33:01.7267275Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.7267709Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.7268127Z     T=1,
2025-05-07T20:33:01.7268302Z     D=5120,
2025-05-07T20:33:01.7268489Z     scale_ub=None,
2025-05-07T20:33:01.7268701Z     contiguous=True,
2025-05-07T20:33:01.7268925Z     compiled=False,
2025-05-07T20:33:01.7269133Z )
2025-05-07T20:33:02.0805392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.0806202Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:02.0806587Z 
2025-05-07T20:33:02.0806706Z     @given(
2025-05-07T20:33:02.0807019Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.0807434Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.0807752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.0808095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.0808450Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.0808748Z     )
2025-05-07T20:33:02.0809112Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.0809588Z     def test_silu_mul_quant(
2025-05-07T20:33:02.0809838Z         self,
2025-05-07T20:33:02.0810037Z         T: int,
2025-05-07T20:33:02.0810235Z         D: int,
2025-05-07T20:33:02.0810452Z         scale_ub: Optional[float],
2025-05-07T20:33:02.0810736Z         contiguous: bool,
2025-05-07T20:33:02.0810977Z         compiled: bool,
2025-05-07T20:33:02.0811221Z     ) -> None:
2025-05-07T20:33:02.0811442Z         torch.manual_seed(2025)
2025-05-07T20:33:02.0811682Z     
2025-05-07T20:33:02.0811957Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.0812313Z     
2025-05-07T20:33:02.0812501Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.0812799Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.0813118Z         x = x_sign * x_clamp
2025-05-07T20:33:02.0813357Z         x0 = x[:, :D]
2025-05-07T20:33:02.0813575Z         x1 = x[:, D:]
2025-05-07T20:33:02.0813907Z     
2025-05-07T20:33:02.0814095Z         if contiguous:
2025-05-07T20:33:02.0814327Z             x0 = x0.contiguous()
2025-05-07T20:33:02.0814592Z             x1 = x1.contiguous()
2025-05-07T20:33:02.0814828Z     
2025-05-07T20:33:02.0815022Z         if scale_ub is not None:
2025-05-07T20:33:02.0815302Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.0815650Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.0815963Z             )
2025-05-07T20:33:02.0816157Z         else:
2025-05-07T20:33:02.0816368Z             scale_ub_tensor = None
2025-05-07T20:33:02.0816615Z     
2025-05-07T20:33:02.0816850Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.0817176Z             op = silu_mul_quant
2025-05-07T20:33:02.0817426Z             if compiled:
2025-05-07T20:33:02.0817677Z                 op = torch.compile(op)
2025-05-07T20:33:02.0817982Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.0818260Z     
2025-05-07T20:33:02.0818459Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.0818627Z 
2025-05-07T20:33:02.0818761Z moe/activation_test.py:117: 
2025-05-07T20:33:02.0819064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.0819406Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.0819769Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.0820519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.0821355Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.0821949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.0822687Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.0823457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.0824033Z     kernel = self.compile(
2025-05-07T20:33:02.0824600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.0825292Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.0825708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.0825949Z 
2025-05-07T20:33:02.0826167Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b89f8640>
2025-05-07T20:33:02.0827337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.0828851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b906adc0>}
2025-05-07T20:33:02.0830461Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.0831568Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8e1e370>
2025-05-07T20:33:02.0831871Z 
2025-05-07T20:33:02.0832041Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.0832589Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.0833078Z                            module_map=module_map)
2025-05-07T20:33:02.0833447Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.0833813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.0834077Z E       ^
2025-05-07T20:33:02.0834576Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.0835068Z 
2025-05-07T20:33:02.0835567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.0836127Z 
2025-05-07T20:33:02.0836229Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.0836655Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.0837077Z     T=128,
2025-05-07T20:33:02.0837261Z     D=5120,
2025-05-07T20:33:02.0837449Z     scale_ub=None,
2025-05-07T20:33:02.0837662Z     contiguous=False,
2025-05-07T20:33:02.0843632Z     compiled=True,
2025-05-07T20:33:02.0843889Z )
2025-05-07T20:33:02.0844238Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.0844769Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:02.0845075Z 
2025-05-07T20:33:02.0845158Z     @given(
2025-05-07T20:33:02.0845400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.0845745Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.0846068Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.0846420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.0846769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.0847062Z     )
2025-05-07T20:33:02.0847501Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.0848014Z     def test_silu_mul_quant(
2025-05-07T20:33:02.0848262Z         self,
2025-05-07T20:33:02.0848465Z         T: int,
2025-05-07T20:33:02.0848670Z         D: int,
2025-05-07T20:33:02.0848885Z         scale_ub: Optional[float],
2025-05-07T20:33:02.0849166Z         contiguous: bool,
2025-05-07T20:33:02.0849413Z         compiled: bool,
2025-05-07T20:33:02.0849690Z     ) -> None:
2025-05-07T20:33:02.0849918Z         torch.manual_seed(2025)
2025-05-07T20:33:02.0850173Z     
2025-05-07T20:33:02.0850461Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.0850826Z     
2025-05-07T20:33:02.0851035Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.0851375Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.0851723Z         x = x_sign * x_clamp
2025-05-07T20:33:02.0851979Z         x0 = x[:, :D]
2025-05-07T20:33:02.0852210Z         x1 = x[:, D:]
2025-05-07T20:33:02.0852427Z     
2025-05-07T20:33:02.0852624Z         if contiguous:
2025-05-07T20:33:02.0852869Z             x0 = x0.contiguous()
2025-05-07T20:33:02.0853139Z             x1 = x1.contiguous()
2025-05-07T20:33:02.0853397Z     
2025-05-07T20:33:02.0853610Z         if scale_ub is not None:
2025-05-07T20:33:02.0853896Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.0854247Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.0854567Z             )
2025-05-07T20:33:02.0854760Z         else:
2025-05-07T20:33:02.0854977Z             scale_ub_tensor = None
2025-05-07T20:33:02.0855244Z     
2025-05-07T20:33:02.0855474Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.0855802Z             op = silu_mul_quant
2025-05-07T20:33:02.0856059Z             if compiled:
2025-05-07T20:33:02.0856312Z                 op = torch.compile(op)
2025-05-07T20:33:02.0856613Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.0856902Z     
2025-05-07T20:33:02.0857101Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.0857268Z 
2025-05-07T20:33:02.0857371Z moe/activation_test.py:117: 
2025-05-07T20:33:02.0857674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.0858019Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.0858305Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.0858897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.0859502Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.0860268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.0861010Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.0861579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.0862316Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.0863021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.0863590Z     kernel = self.compile(
2025-05-07T20:33:02.0864161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.0864858Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.0865268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.0865517Z 
2025-05-07T20:33:02.0865733Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8edbee0>
2025-05-07T20:33:02.0866903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.0868454Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b88a7040>}
2025-05-07T20:33:02.0870070Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.0871180Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8f218f0>
2025-05-07T20:33:02.0871531Z 
2025-05-07T20:33:02.0871699Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.0872252Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.0872742Z                            module_map=module_map)
2025-05-07T20:33:02.0873122Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.0873483Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.0873756Z E       ^
2025-05-07T20:33:02.0874245Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.0874743Z 
2025-05-07T20:33:02.0875194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.0875749Z 
2025-05-07T20:33:02.0875858Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.0876289Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.0876706Z     T=128,
2025-05-07T20:33:02.0876894Z     D=7168,
2025-05-07T20:33:02.0877088Z     scale_ub=1200.0,
2025-05-07T20:33:02.0877315Z     contiguous=False,
2025-05-07T20:33:02.0877546Z     compiled=False,
2025-05-07T20:33:02.0877756Z )
2025-05-07T20:33:02.2388806Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.2389697Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:02.2390212Z 
2025-05-07T20:33:02.2390324Z     @given(
2025-05-07T20:33:02.2390665Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.2391121Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.2391460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.2391822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.2392162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.2392468Z     )
2025-05-07T20:33:02.2392830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.2393293Z     def test_silu_mul_quant(
2025-05-07T20:33:02.2393749Z         self,
2025-05-07T20:33:02.2393951Z         T: int,
2025-05-07T20:33:02.2394155Z         D: int,
2025-05-07T20:33:02.2394380Z         scale_ub: Optional[float],
2025-05-07T20:33:02.2394655Z         contiguous: bool,
2025-05-07T20:33:02.2394904Z         compiled: bool,
2025-05-07T20:33:02.2395140Z     ) -> None:
2025-05-07T20:33:02.2395361Z         torch.manual_seed(2025)
2025-05-07T20:33:02.2395618Z     
2025-05-07T20:33:02.2395904Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.2396257Z     
2025-05-07T20:33:02.2396456Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.2396755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.2397073Z         x = x_sign * x_clamp
2025-05-07T20:33:02.2397320Z         x0 = x[:, :D]
2025-05-07T20:33:02.2397544Z         x1 = x[:, D:]
2025-05-07T20:33:02.2397755Z     
2025-05-07T20:33:02.2397950Z         if contiguous:
2025-05-07T20:33:02.2398189Z             x0 = x0.contiguous()
2025-05-07T20:33:02.2398462Z             x1 = x1.contiguous()
2025-05-07T20:33:02.2398706Z     
2025-05-07T20:33:02.2398903Z         if scale_ub is not None:
2025-05-07T20:33:02.2399188Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.2399532Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.2399851Z             )
2025-05-07T20:33:02.2400123Z         else:
2025-05-07T20:33:02.2400333Z             scale_ub_tensor = None
2025-05-07T20:33:02.2400642Z     
2025-05-07T20:33:02.2400882Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.2401211Z             op = silu_mul_quant
2025-05-07T20:33:02.2401470Z             if compiled:
2025-05-07T20:33:02.2401729Z                 op = torch.compile(op)
2025-05-07T20:33:02.2402036Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.2402381Z     
2025-05-07T20:33:02.2402567Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.2402732Z 
2025-05-07T20:33:02.2402837Z moe/activation_test.py:117: 
2025-05-07T20:33:02.2403134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2403480Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.2403765Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.2404498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.2405241Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.2405809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.2406539Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.2407247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.2407816Z     kernel = self.compile(
2025-05-07T20:33:02.2408389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.2409080Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.2409492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2409737Z 
2025-05-07T20:33:02.2409955Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8f00400>
2025-05-07T20:33:02.2411123Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.2412620Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b88a7c10>}
2025-05-07T20:33:02.2414139Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.2415242Z context = <triton._C.libtriton.ir.context object at 0x7fd1b89024f0>
2025-05-07T20:33:02.2415544Z 
2025-05-07T20:33:02.2415717Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.2416266Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.2416763Z                            module_map=module_map)
2025-05-07T20:33:02.2417148Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.2417514Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.2417779Z E       ^
2025-05-07T20:33:02.2418274Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.2418767Z 
2025-05-07T20:33:02.2419225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.2419785Z 
2025-05-07T20:33:02.2419893Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.2420322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.2420744Z     T=128,
2025-05-07T20:33:02.2420927Z     D=5120,
2025-05-07T20:33:02.2421110Z     scale_ub=None,
2025-05-07T20:33:02.2421369Z     contiguous=False,
2025-05-07T20:33:02.2421592Z     compiled=False,
2025-05-07T20:33:02.2421823Z )
2025-05-07T20:33:02.2422145Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.2422661Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:02.2422944Z 
2025-05-07T20:33:02.2423028Z     @given(
2025-05-07T20:33:02.2423257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.2423622Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.2423940Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.2424281Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.2424618Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.2424910Z     )
2025-05-07T20:33:02.2425271Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.2425734Z     def test_silu_mul_quant(
2025-05-07T20:33:02.2425982Z         self,
2025-05-07T20:33:02.2426177Z         T: int,
2025-05-07T20:33:02.2426386Z         D: int,
2025-05-07T20:33:02.2426618Z         scale_ub: Optional[float],
2025-05-07T20:33:02.2426894Z         contiguous: bool,
2025-05-07T20:33:02.2427141Z         compiled: bool,
2025-05-07T20:33:02.2427371Z     ) -> None:
2025-05-07T20:33:02.2427588Z         torch.manual_seed(2025)
2025-05-07T20:33:02.2427844Z     
2025-05-07T20:33:02.2428122Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.2428476Z     
2025-05-07T20:33:02.2428667Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.2428971Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.2429288Z         x = x_sign * x_clamp
2025-05-07T20:33:02.2429542Z         x0 = x[:, :D]
2025-05-07T20:33:02.2429759Z         x1 = x[:, D:]
2025-05-07T20:33:02.2430078Z     
2025-05-07T20:33:02.2430265Z         if contiguous:
2025-05-07T20:33:02.2430503Z             x0 = x0.contiguous()
2025-05-07T20:33:02.2430765Z             x1 = x1.contiguous()
2025-05-07T20:33:02.2431015Z     
2025-05-07T20:33:02.2431210Z         if scale_ub is not None:
2025-05-07T20:33:02.2431480Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.2431824Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.2432138Z             )
2025-05-07T20:33:02.2432326Z         else:
2025-05-07T20:33:02.2432526Z             scale_ub_tensor = None
2025-05-07T20:33:02.2432786Z     
2025-05-07T20:33:02.2433012Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.2433333Z             op = silu_mul_quant
2025-05-07T20:33:02.2433580Z             if compiled:
2025-05-07T20:33:02.2433896Z                 op = torch.compile(op)
2025-05-07T20:33:02.2434194Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.2434472Z     
2025-05-07T20:33:02.2434669Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.2434838Z 
2025-05-07T20:33:02.2434939Z moe/activation_test.py:117: 
2025-05-07T20:33:02.2435250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2435602Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.2435887Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.2436626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.2437371Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.2437940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.2438670Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.2439380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.2439944Z     kernel = self.compile(
2025-05-07T20:33:02.2440558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.2441253Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.2441704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.2441943Z 
2025-05-07T20:33:02.2442160Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8ce55b0>
2025-05-07T20:33:02.2443322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.2444858Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b887c310>}
2025-05-07T20:33:02.2446324Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.2447434Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8876a30>
2025-05-07T20:33:02.2447743Z 
2025-05-07T20:33:02.2447921Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.2448467Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.2448953Z                            module_map=module_map)
2025-05-07T20:33:02.2449333Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.2449698Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.2449965Z E       ^
2025-05-07T20:33:02.2450463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.2450954Z 
2025-05-07T20:33:02.2451409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.2451970Z 
2025-05-07T20:33:02.2452075Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.2452508Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.2452928Z     T=128,
2025-05-07T20:33:02.2453115Z     D=5120,
2025-05-07T20:33:02.2453308Z     scale_ub=1200.0,
2025-05-07T20:33:02.2453532Z     contiguous=True,
2025-05-07T20:33:02.2453761Z     compiled=False,
2025-05-07T20:33:02.2453963Z )
2025-05-07T20:33:02.4728988Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.4729801Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:02.4730325Z 
2025-05-07T20:33:02.4730446Z     @given(
2025-05-07T20:33:02.4730771Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.4731100Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.4731422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.4731764Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.4732108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.4732410Z     )
2025-05-07T20:33:02.4732769Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.4733241Z     def test_silu_mul_quant(
2025-05-07T20:33:02.4733492Z         self,
2025-05-07T20:33:02.4733690Z         T: int,
2025-05-07T20:33:02.4733890Z         D: int,
2025-05-07T20:33:02.4734116Z         scale_ub: Optional[float],
2025-05-07T20:33:02.4734396Z         contiguous: bool,
2025-05-07T20:33:02.4734634Z         compiled: bool,
2025-05-07T20:33:02.4734866Z     ) -> None:
2025-05-07T20:33:02.4735089Z         torch.manual_seed(2025)
2025-05-07T20:33:02.4735338Z     
2025-05-07T20:33:02.4735615Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.4735974Z     
2025-05-07T20:33:02.4736166Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.4736539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.4736888Z         x = x_sign * x_clamp
2025-05-07T20:33:02.4737200Z         x0 = x[:, :D]
2025-05-07T20:33:02.4737427Z         x1 = x[:, D:]
2025-05-07T20:33:02.4737644Z     
2025-05-07T20:33:02.4737831Z         if contiguous:
2025-05-07T20:33:02.4738071Z             x0 = x0.contiguous()
2025-05-07T20:33:02.4738353Z             x1 = x1.contiguous()
2025-05-07T20:33:02.4738605Z     
2025-05-07T20:33:02.4738802Z         if scale_ub is not None:
2025-05-07T20:33:02.4739175Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.4739515Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.4739827Z             )
2025-05-07T20:33:02.4740024Z         else:
2025-05-07T20:33:02.4740233Z             scale_ub_tensor = None
2025-05-07T20:33:02.4740485Z     
2025-05-07T20:33:02.4740714Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.4741036Z             op = silu_mul_quant
2025-05-07T20:33:02.4741291Z             if compiled:
2025-05-07T20:33:02.4741550Z                 op = torch.compile(op)
2025-05-07T20:33:02.4741852Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.4742127Z     
2025-05-07T20:33:02.4742319Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.4742483Z 
2025-05-07T20:33:02.4742587Z moe/activation_test.py:117: 
2025-05-07T20:33:02.4742881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.4743235Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.4743519Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.4744258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.4744996Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.4745561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.4746289Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.4746994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.4747562Z     kernel = self.compile(
2025-05-07T20:33:02.4748132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.4748833Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.4749245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.4749489Z 
2025-05-07T20:33:02.4749748Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8873d90>
2025-05-07T20:33:02.4751070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.4752576Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b887cee0>}
2025-05-07T20:33:02.4754040Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.4755142Z context = <triton._C.libtriton.ir.context object at 0x7fd1b88019f0>
2025-05-07T20:33:02.4755451Z 
2025-05-07T20:33:02.4755616Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.4756163Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.4756645Z                            module_map=module_map)
2025-05-07T20:33:02.4757015Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.4757382Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.4757644Z E       ^
2025-05-07T20:33:02.4758171Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.4758699Z 
2025-05-07T20:33:02.4759148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.4759702Z 
2025-05-07T20:33:02.4759810Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.4760270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.4760687Z     T=1,
2025-05-07T20:33:02.4760869Z     D=7168,
2025-05-07T20:33:02.4761061Z     scale_ub=1200.0,
2025-05-07T20:33:02.4761278Z     contiguous=True,
2025-05-07T20:33:02.4761502Z     compiled=True,
2025-05-07T20:33:02.4761739Z )
2025-05-07T20:33:02.4762072Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.4762581Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:02.4762853Z 
2025-05-07T20:33:02.4762939Z     @given(
2025-05-07T20:33:02.4763166Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.4763490Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.4763805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.4764144Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.4764475Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.4764771Z     )
2025-05-07T20:33:02.4765129Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.4765589Z     def test_silu_mul_quant(
2025-05-07T20:33:02.4765837Z         self,
2025-05-07T20:33:02.4766030Z         T: int,
2025-05-07T20:33:02.4766225Z         D: int,
2025-05-07T20:33:02.4766444Z         scale_ub: Optional[float],
2025-05-07T20:33:02.4766716Z         contiguous: bool,
2025-05-07T20:33:02.4766955Z         compiled: bool,
2025-05-07T20:33:02.4767180Z     ) -> None:
2025-05-07T20:33:02.4767393Z         torch.manual_seed(2025)
2025-05-07T20:33:02.4767635Z     
2025-05-07T20:33:02.4767911Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.4768266Z     
2025-05-07T20:33:02.4768452Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.4768746Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.4769065Z         x = x_sign * x_clamp
2025-05-07T20:33:02.4769310Z         x0 = x[:, :D]
2025-05-07T20:33:02.4769522Z         x1 = x[:, D:]
2025-05-07T20:33:02.4769728Z     
2025-05-07T20:33:02.4769913Z         if contiguous:
2025-05-07T20:33:02.4770141Z             x0 = x0.contiguous()
2025-05-07T20:33:02.4770452Z             x1 = x1.contiguous()
2025-05-07T20:33:02.4770693Z     
2025-05-07T20:33:02.4770877Z         if scale_ub is not None:
2025-05-07T20:33:02.4771155Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.4771493Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.4771801Z             )
2025-05-07T20:33:02.4772001Z         else:
2025-05-07T20:33:02.4772207Z             scale_ub_tensor = None
2025-05-07T20:33:02.4772462Z     
2025-05-07T20:33:02.4772703Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.4773025Z             op = silu_mul_quant
2025-05-07T20:33:02.4773282Z             if compiled:
2025-05-07T20:33:02.4773538Z                 op = torch.compile(op)
2025-05-07T20:33:02.4779288Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.4779604Z     
2025-05-07T20:33:02.4779806Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.4779988Z 
2025-05-07T20:33:02.4780094Z moe/activation_test.py:117: 
2025-05-07T20:33:02.4780420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.4780769Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.4781067Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.4781738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.4782349Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.4783337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.4784086Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.4784658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.4785469Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.4786193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.4786770Z     kernel = self.compile(
2025-05-07T20:33:02.4787350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.4788048Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.4788472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.4788716Z 
2025-05-07T20:33:02.4788940Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8c7cee0>
2025-05-07T20:33:02.4790216Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.4791745Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8cd4940>}
2025-05-07T20:33:02.4793237Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.4794361Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8ded330>
2025-05-07T20:33:02.4794672Z 
2025-05-07T20:33:02.4794856Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.4795412Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.4795916Z                            module_map=module_map)
2025-05-07T20:33:02.4796309Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.4796694Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.4796968Z E       ^
2025-05-07T20:33:02.4797475Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.4798038Z 
2025-05-07T20:33:02.4798494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.4799052Z 
2025-05-07T20:33:02.4799164Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.4799593Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.4800028Z     T=1,
2025-05-07T20:33:02.4800219Z     D=7168,
2025-05-07T20:33:02.4800414Z     scale_ub=1200.0,
2025-05-07T20:33:02.4800644Z     contiguous=False,
2025-05-07T20:33:02.4800876Z     compiled=True,
2025-05-07T20:33:02.4801080Z )
2025-05-07T20:33:02.8113701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.8114520Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:02.8114909Z 
2025-05-07T20:33:02.8115028Z     @given(
2025-05-07T20:33:02.8115373Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.8115817Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.8116247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.8116593Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.8116932Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.8117230Z     )
2025-05-07T20:33:02.8117708Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.8118232Z     def test_silu_mul_quant(
2025-05-07T20:33:02.8118483Z         self,
2025-05-07T20:33:02.8118681Z         T: int,
2025-05-07T20:33:02.8118880Z         D: int,
2025-05-07T20:33:02.8119099Z         scale_ub: Optional[float],
2025-05-07T20:33:02.8119374Z         contiguous: bool,
2025-05-07T20:33:02.8119615Z         compiled: bool,
2025-05-07T20:33:02.8119907Z     ) -> None:
2025-05-07T20:33:02.8120130Z         torch.manual_seed(2025)
2025-05-07T20:33:02.8120379Z     
2025-05-07T20:33:02.8120656Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.8121017Z     
2025-05-07T20:33:02.8121219Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.8121519Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.8121849Z         x = x_sign * x_clamp
2025-05-07T20:33:02.8122102Z         x0 = x[:, :D]
2025-05-07T20:33:02.8122333Z         x1 = x[:, D:]
2025-05-07T20:33:02.8122545Z     
2025-05-07T20:33:02.8122739Z         if contiguous:
2025-05-07T20:33:02.8122975Z             x0 = x0.contiguous()
2025-05-07T20:33:02.8123245Z             x1 = x1.contiguous()
2025-05-07T20:33:02.8123495Z     
2025-05-07T20:33:02.8123691Z         if scale_ub is not None:
2025-05-07T20:33:02.8123977Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.8124334Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.8124665Z             )
2025-05-07T20:33:02.8124858Z         else:
2025-05-07T20:33:02.8125078Z             scale_ub_tensor = None
2025-05-07T20:33:02.8125347Z     
2025-05-07T20:33:02.8125582Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.8125914Z             op = silu_mul_quant
2025-05-07T20:33:02.8126174Z             if compiled:
2025-05-07T20:33:02.8126425Z                 op = torch.compile(op)
2025-05-07T20:33:02.8126731Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8127021Z     
2025-05-07T20:33:02.8127210Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.8127389Z 
2025-05-07T20:33:02.8127489Z moe/activation_test.py:117: 
2025-05-07T20:33:02.8127800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8128155Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.8128440Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8129042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.8129653Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.8130434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.8131184Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.8131753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.8132490Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.8133196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.8133769Z     kernel = self.compile(
2025-05-07T20:33:02.8134344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.8135049Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.8135473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8135726Z 
2025-05-07T20:33:02.8135946Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8dde520>
2025-05-07T20:33:02.8137126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.8138682Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8bfb5e0>}
2025-05-07T20:33:02.8140227Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.8141377Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8abc6b0>
2025-05-07T20:33:02.8141682Z 
2025-05-07T20:33:02.8141862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.8142412Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.8142903Z                            module_map=module_map)
2025-05-07T20:33:02.8143305Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.8143674Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.8143941Z E       ^
2025-05-07T20:33:02.8144443Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.8144947Z 
2025-05-07T20:33:02.8145400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.8145963Z 
2025-05-07T20:33:02.8146075Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.8146509Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.8146932Z     T=1,
2025-05-07T20:33:02.8147119Z     D=7168,
2025-05-07T20:33:02.8147313Z     scale_ub=None,
2025-05-07T20:33:02.8147536Z     contiguous=False,
2025-05-07T20:33:02.8147765Z     compiled=True,
2025-05-07T20:33:02.8147968Z )
2025-05-07T20:33:02.9278111Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.9278907Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:02.9279293Z 
2025-05-07T20:33:02.9279383Z     @given(
2025-05-07T20:33:02.9279626Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.9279956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.9280268Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.9280614Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.9280957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.9281251Z     )
2025-05-07T20:33:02.9281620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.9282194Z     def test_silu_mul_quant(
2025-05-07T20:33:02.9282448Z         self,
2025-05-07T20:33:02.9282639Z         T: int,
2025-05-07T20:33:02.9283081Z         D: int,
2025-05-07T20:33:02.9283303Z         scale_ub: Optional[float],
2025-05-07T20:33:02.9283576Z         contiguous: bool,
2025-05-07T20:33:02.9283820Z         compiled: bool,
2025-05-07T20:33:02.9284048Z     ) -> None:
2025-05-07T20:33:02.9284261Z         torch.manual_seed(2025)
2025-05-07T20:33:02.9284516Z     
2025-05-07T20:33:02.9284790Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.9285143Z     
2025-05-07T20:33:02.9285333Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.9285628Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.9285937Z         x = x_sign * x_clamp
2025-05-07T20:33:02.9286179Z         x0 = x[:, :D]
2025-05-07T20:33:02.9286394Z         x1 = x[:, D:]
2025-05-07T20:33:02.9286596Z     
2025-05-07T20:33:02.9286782Z         if contiguous:
2025-05-07T20:33:02.9287013Z             x0 = x0.contiguous()
2025-05-07T20:33:02.9287269Z             x1 = x1.contiguous()
2025-05-07T20:33:02.9287513Z     
2025-05-07T20:33:02.9287699Z         if scale_ub is not None:
2025-05-07T20:33:02.9287973Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.9288307Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.9288740Z             )
2025-05-07T20:33:02.9288932Z         else:
2025-05-07T20:33:02.9289189Z             scale_ub_tensor = None
2025-05-07T20:33:02.9289449Z     
2025-05-07T20:33:02.9289679Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.9290000Z             op = silu_mul_quant
2025-05-07T20:33:02.9290256Z             if compiled:
2025-05-07T20:33:02.9290502Z                 op = torch.compile(op)
2025-05-07T20:33:02.9290860Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.9291139Z     
2025-05-07T20:33:02.9291330Z         y_fp8, y_scale = fn()
2025-05-07T20:33:02.9291616Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:02.9291918Z     
2025-05-07T20:33:02.9292157Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.9292510Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:02.9292812Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:02.9293135Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:02.9293509Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:02.9293833Z     
2025-05-07T20:33:02.9294038Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:02.9294256Z 
2025-05-07T20:33:02.9294358Z moe/activation_test.py:126: 
2025-05-07T20:33:02.9294664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.9295015Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:02.9295351Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:02.9296204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:02.9297013Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:02.9297591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.9298320Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.9299059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:02.9299825Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:02.9300631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:02.9301446Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:02.9302332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:02.9303014Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:02.9303657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:02.9304207Z     fn()
2025-05-07T20:33:02.9304737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:02.9305363Z     self.fn.run(
2025-05-07T20:33:02.9305859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.9306424Z     kernel = self.compile(
2025-05-07T20:33:02.9306993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.9307699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.9308121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.9308364Z 
2025-05-07T20:33:02.9308583Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8a994c0>
2025-05-07T20:33:02.9309935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.9311487Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8a4b160>}
2025-05-07T20:33:02.9313006Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.9314161Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8a476f0>
2025-05-07T20:33:02.9314463Z 
2025-05-07T20:33:02.9314633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.9315183Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.9315671Z                            module_map=module_map)
2025-05-07T20:33:02.9316051Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.9316415Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:02.9316696Z E       ^
2025-05-07T20:33:02.9317182Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.9317669Z 
2025-05-07T20:33:02.9318115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.9318678Z 
2025-05-07T20:33:02.9318778Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.9319200Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.9319616Z     T=1,
2025-05-07T20:33:02.9319789Z     D=5120,
2025-05-07T20:33:02.9319979Z     scale_ub=1200.0,
2025-05-07T20:33:02.9320198Z     contiguous=False,
2025-05-07T20:33:02.9320418Z     compiled=True,
2025-05-07T20:33:02.9320620Z )
2025-05-07T20:33:03.1303575Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.1304388Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.1304791Z 
2025-05-07T20:33:03.1304898Z     @given(
2025-05-07T20:33:03.1305211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.1305597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.1305912Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.1306262Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.1306595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.1306887Z     )
2025-05-07T20:33:03.1307399Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.1308050Z     def test_silu_mul_quant(
2025-05-07T20:33:03.1308290Z         self,
2025-05-07T20:33:03.1308480Z         T: int,
2025-05-07T20:33:03.1308674Z         D: int,
2025-05-07T20:33:03.1308885Z         scale_ub: Optional[float],
2025-05-07T20:33:03.1309156Z         contiguous: bool,
2025-05-07T20:33:03.1309396Z         compiled: bool,
2025-05-07T20:33:03.1309617Z     ) -> None:
2025-05-07T20:33:03.1309880Z         torch.manual_seed(2025)
2025-05-07T20:33:03.1310129Z     
2025-05-07T20:33:03.1310397Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.1310754Z     
2025-05-07T20:33:03.1310947Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.1311233Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.1311575Z         x = x_sign * x_clamp
2025-05-07T20:33:03.1311838Z         x0 = x[:, :D]
2025-05-07T20:33:03.1312045Z         x1 = x[:, D:]
2025-05-07T20:33:03.1312254Z     
2025-05-07T20:33:03.1312437Z         if contiguous:
2025-05-07T20:33:03.1312664Z             x0 = x0.contiguous()
2025-05-07T20:33:03.1312924Z             x1 = x1.contiguous()
2025-05-07T20:33:03.1313164Z     
2025-05-07T20:33:03.1313353Z         if scale_ub is not None:
2025-05-07T20:33:03.1313694Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.1314070Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.1314475Z             )
2025-05-07T20:33:03.1314668Z         else:
2025-05-07T20:33:03.1314891Z             scale_ub_tensor = None
2025-05-07T20:33:03.1315162Z     
2025-05-07T20:33:03.1315402Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.1315752Z             op = silu_mul_quant
2025-05-07T20:33:03.1316086Z             if compiled:
2025-05-07T20:33:03.1316346Z                 op = torch.compile(op)
2025-05-07T20:33:03.1316675Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.1316982Z     
2025-05-07T20:33:03.1317180Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.1317367Z 
2025-05-07T20:33:03.1317469Z moe/activation_test.py:117: 
2025-05-07T20:33:03.1317797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.1318176Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.1318481Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.1319145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.1319826Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.1320617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.1321457Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.1322108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.1322985Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.1323779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.1324416Z     kernel = self.compile(
2025-05-07T20:33:03.1325062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.1325846Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.1326312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.1326585Z 
2025-05-07T20:33:03.1326825Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8a7bee0>
2025-05-07T20:33:03.1328176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.1329994Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8a4bb80>}
2025-05-07T20:33:03.1331693Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.1332971Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8b76830>
2025-05-07T20:33:03.1333320Z 
2025-05-07T20:33:03.1333504Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.1334124Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.1334673Z                            module_map=module_map)
2025-05-07T20:33:03.1335052Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.1335411Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.1335665Z E       ^
2025-05-07T20:33:03.1336158Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.1336647Z 
2025-05-07T20:33:03.1337091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.1337687Z 
2025-05-07T20:33:03.1337795Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.1338252Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.1338669Z     T=1,
2025-05-07T20:33:03.1338848Z     D=5120,
2025-05-07T20:33:03.1339033Z     scale_ub=1200.0,
2025-05-07T20:33:03.1339253Z     contiguous=False,
2025-05-07T20:33:03.1339480Z     compiled=False,
2025-05-07T20:33:03.1339743Z )
2025-05-07T20:33:03.1340064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.1340574Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.1340856Z 
2025-05-07T20:33:03.1340936Z     @given(
2025-05-07T20:33:03.1341159Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.1341472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.1341782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.1342115Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.1342445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.1342739Z     )
2025-05-07T20:33:03.1343092Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.1343559Z     def test_silu_mul_quant(
2025-05-07T20:33:03.1343799Z         self,
2025-05-07T20:33:03.1343983Z         T: int,
2025-05-07T20:33:03.1344178Z         D: int,
2025-05-07T20:33:03.1344402Z         scale_ub: Optional[float],
2025-05-07T20:33:03.1344674Z         contiguous: bool,
2025-05-07T20:33:03.1344908Z         compiled: bool,
2025-05-07T20:33:03.1345127Z     ) -> None:
2025-05-07T20:33:03.1345341Z         torch.manual_seed(2025)
2025-05-07T20:33:03.1345576Z     
2025-05-07T20:33:03.1345849Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.1346203Z     
2025-05-07T20:33:03.1346386Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.1346676Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.1346991Z         x = x_sign * x_clamp
2025-05-07T20:33:03.1347229Z         x0 = x[:, :D]
2025-05-07T20:33:03.1347439Z         x1 = x[:, D:]
2025-05-07T20:33:03.1347646Z     
2025-05-07T20:33:03.1347819Z         if contiguous:
2025-05-07T20:33:03.1348046Z             x0 = x0.contiguous()
2025-05-07T20:33:03.1348305Z             x1 = x1.contiguous()
2025-05-07T20:33:03.1348539Z     
2025-05-07T20:33:03.1348731Z         if scale_ub is not None:
2025-05-07T20:33:03.1349006Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.1349339Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.1349654Z             )
2025-05-07T20:33:03.1349983Z         else:
2025-05-07T20:33:03.1350190Z             scale_ub_tensor = None
2025-05-07T20:33:03.1350437Z     
2025-05-07T20:33:03.1350661Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.1350979Z             op = silu_mul_quant
2025-05-07T20:33:03.1351222Z             if compiled:
2025-05-07T20:33:03.1351471Z                 op = torch.compile(op)
2025-05-07T20:33:03.1351777Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.1352054Z     
2025-05-07T20:33:03.1352243Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.1352409Z 
2025-05-07T20:33:03.1352511Z moe/activation_test.py:117: 
2025-05-07T20:33:03.1352828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.1353165Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.1353453Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.1354194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.1354931Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.1355491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.1356264Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.1356965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.1357561Z     kernel = self.compile(
2025-05-07T20:33:03.1358130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.1358829Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.1359279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.1359518Z 
2025-05-07T20:33:03.1359732Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8a143d0>
2025-05-07T20:33:03.1360898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.1362405Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8b57550>}
2025-05-07T20:33:03.1363877Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.1364974Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8a690f0>
2025-05-07T20:33:03.1365289Z 
2025-05-07T20:33:03.1365459Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.1372244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.1372755Z                            module_map=module_map)
2025-05-07T20:33:03.1373134Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.1373493Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.1373760Z E       ^
2025-05-07T20:33:03.1374263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.1374757Z 
2025-05-07T20:33:03.1375209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.1375772Z 
2025-05-07T20:33:03.1375875Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.1376305Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.1376726Z     T=16384,
2025-05-07T20:33:03.1376913Z     D=5120,
2025-05-07T20:33:03.1377101Z     scale_ub=1200.0,
2025-05-07T20:33:03.1377448Z     contiguous=False,
2025-05-07T20:33:03.1377674Z     compiled=True,
2025-05-07T20:33:03.1377879Z )
2025-05-07T20:33:03.2551428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.2552206Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.2552529Z 
2025-05-07T20:33:03.2552619Z     @given(
2025-05-07T20:33:03.2552856Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.2553191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.2553517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.2553855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.2554201Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.2554513Z     )
2025-05-07T20:33:03.2554886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.2555360Z     def test_silu_mul_quant(
2025-05-07T20:33:03.2555614Z         self,
2025-05-07T20:33:03.2555817Z         T: int,
2025-05-07T20:33:03.2556017Z         D: int,
2025-05-07T20:33:03.2556246Z         scale_ub: Optional[float],
2025-05-07T20:33:03.2556531Z         contiguous: bool,
2025-05-07T20:33:03.2556776Z         compiled: bool,
2025-05-07T20:33:03.2557012Z     ) -> None:
2025-05-07T20:33:03.2557348Z         torch.manual_seed(2025)
2025-05-07T20:33:03.2557602Z     
2025-05-07T20:33:03.2557934Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.2558286Z     
2025-05-07T20:33:03.2558477Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.2558773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.2559090Z         x = x_sign * x_clamp
2025-05-07T20:33:03.2559330Z         x0 = x[:, :D]
2025-05-07T20:33:03.2559609Z         x1 = x[:, D:]
2025-05-07T20:33:03.2559815Z     
2025-05-07T20:33:03.2560000Z         if contiguous:
2025-05-07T20:33:03.2560235Z             x0 = x0.contiguous()
2025-05-07T20:33:03.2560495Z             x1 = x1.contiguous()
2025-05-07T20:33:03.2560737Z     
2025-05-07T20:33:03.2560931Z         if scale_ub is not None:
2025-05-07T20:33:03.2561199Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.2561539Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.2561858Z             )
2025-05-07T20:33:03.2562053Z         else:
2025-05-07T20:33:03.2562256Z             scale_ub_tensor = None
2025-05-07T20:33:03.2562519Z     
2025-05-07T20:33:03.2562754Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.2563078Z             op = silu_mul_quant
2025-05-07T20:33:03.2563338Z             if compiled:
2025-05-07T20:33:03.2563588Z                 op = torch.compile(op)
2025-05-07T20:33:03.2563892Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.2564182Z     
2025-05-07T20:33:03.2564378Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.2564547Z 
2025-05-07T20:33:03.2564647Z moe/activation_test.py:117: 
2025-05-07T20:33:03.2564956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.2565305Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.2565592Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.2566190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.2566789Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.2567503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.2568245Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.2568817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.2569562Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.2570347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.2570921Z     kernel = self.compile(
2025-05-07T20:33:03.2571499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.2572203Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.2572611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.2572864Z 
2025-05-07T20:33:03.2573081Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8a858e0>
2025-05-07T20:33:03.2574255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.2575777Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b84dd1f0>}
2025-05-07T20:33:03.2577255Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.2578401Z context = <triton._C.libtriton.ir.context object at 0x7fd1b84db130>
2025-05-07T20:33:03.2578711Z 
2025-05-07T20:33:03.2578879Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.2579465Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.2579957Z                            module_map=module_map)
2025-05-07T20:33:03.2580329Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.2580736Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.2581008Z E       ^
2025-05-07T20:33:03.2581502Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.2582044Z 
2025-05-07T20:33:03.2582498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.2583302Z 
2025-05-07T20:33:03.2583407Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.2583842Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.2584265Z     T=2048,
2025-05-07T20:33:03.2584459Z     D=7168,
2025-05-07T20:33:03.2584649Z     scale_ub=1200.0,
2025-05-07T20:33:03.2584874Z     contiguous=False,
2025-05-07T20:33:03.2585099Z     compiled=True,
2025-05-07T20:33:03.2585302Z )
2025-05-07T20:33:03.2585627Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.2586154Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.2586443Z 
2025-05-07T20:33:03.2586530Z     @given(
2025-05-07T20:33:03.2586766Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.2587092Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.2587413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.2587754Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.2588081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.2588372Z     )
2025-05-07T20:33:03.2588729Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.2589188Z     def test_silu_mul_quant(
2025-05-07T20:33:03.2589434Z         self,
2025-05-07T20:33:03.2589631Z         T: int,
2025-05-07T20:33:03.2589972Z         D: int,
2025-05-07T20:33:03.2590198Z         scale_ub: Optional[float],
2025-05-07T20:33:03.2590479Z         contiguous: bool,
2025-05-07T20:33:03.2590723Z         compiled: bool,
2025-05-07T20:33:03.2590951Z     ) -> None:
2025-05-07T20:33:03.2591169Z         torch.manual_seed(2025)
2025-05-07T20:33:03.2591416Z     
2025-05-07T20:33:03.2591767Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.2592127Z     
2025-05-07T20:33:03.2592316Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.2592613Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.2592928Z         x = x_sign * x_clamp
2025-05-07T20:33:03.2593176Z         x0 = x[:, :D]
2025-05-07T20:33:03.2593391Z         x1 = x[:, D:]
2025-05-07T20:33:03.2593606Z     
2025-05-07T20:33:03.2593790Z         if contiguous:
2025-05-07T20:33:03.2594020Z             x0 = x0.contiguous()
2025-05-07T20:33:03.2594284Z             x1 = x1.contiguous()
2025-05-07T20:33:03.2594535Z     
2025-05-07T20:33:03.2594724Z         if scale_ub is not None:
2025-05-07T20:33:03.2594997Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.2595337Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.2595654Z             )
2025-05-07T20:33:03.2595841Z         else:
2025-05-07T20:33:03.2596052Z             scale_ub_tensor = None
2025-05-07T20:33:03.2596308Z     
2025-05-07T20:33:03.2596547Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.2596881Z             op = silu_mul_quant
2025-05-07T20:33:03.2597139Z             if compiled:
2025-05-07T20:33:03.2597387Z                 op = torch.compile(op)
2025-05-07T20:33:03.2597693Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.2598048Z     
2025-05-07T20:33:03.2598233Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.2598455Z 
2025-05-07T20:33:03.2598555Z moe/activation_test.py:117: 
2025-05-07T20:33:03.2598862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.2599206Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.2599493Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.2600144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.2600742Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.2601447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.2602195Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.2602769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.2603654Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.2604367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.2604933Z     kernel = self.compile(
2025-05-07T20:33:03.2605500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.2606196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.2606608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.2606847Z 
2025-05-07T20:33:03.2607063Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b84f79d0>
2025-05-07T20:33:03.2608372Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.2609878Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b84ddee0>}
2025-05-07T20:33:03.2611354Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.2612518Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8ae6ef0>
2025-05-07T20:33:03.2612822Z 
2025-05-07T20:33:03.2612995Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.2613597Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.2614094Z                            module_map=module_map)
2025-05-07T20:33:03.2614470Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.2614829Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.2615085Z E       ^
2025-05-07T20:33:03.2615579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.2616068Z 
2025-05-07T20:33:03.2616519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.2617074Z 
2025-05-07T20:33:03.5271885Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5273061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5273800Z     T=1,
2025-05-07T20:33:03.5274111Z     D=5120,
2025-05-07T20:33:03.5274436Z     scale_ub=None,
2025-05-07T20:33:03.5274793Z     contiguous=False,
2025-05-07T20:33:03.5275159Z     compiled=False,
2025-05-07T20:33:03.5275496Z )
2025-05-07T20:33:03.5276018Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5277190Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.5277619Z 
2025-05-07T20:33:03.5277874Z     @given(
2025-05-07T20:33:03.5278212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5278720Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5279263Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5279827Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5280515Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5280965Z     )
2025-05-07T20:33:03.5281602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5282399Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5283120Z         self,
2025-05-07T20:33:03.5283447Z         T: int,
2025-05-07T20:33:03.5283767Z         D: int,
2025-05-07T20:33:03.5284112Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5284558Z         contiguous: bool,
2025-05-07T20:33:03.5284946Z         compiled: bool,
2025-05-07T20:33:03.5285304Z     ) -> None:
2025-05-07T20:33:03.5285659Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5286074Z     
2025-05-07T20:33:03.5286522Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5287113Z     
2025-05-07T20:33:03.5287433Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5287919Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5288426Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5288834Z         x0 = x[:, :D]
2025-05-07T20:33:03.5289194Z         x1 = x[:, D:]
2025-05-07T20:33:03.5289533Z     
2025-05-07T20:33:03.5289841Z         if contiguous:
2025-05-07T20:33:03.5290227Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5290655Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5291058Z     
2025-05-07T20:33:03.5291375Z         if scale_ub is not None:
2025-05-07T20:33:03.5291823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5292388Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5292911Z             )
2025-05-07T20:33:03.5293223Z         else:
2025-05-07T20:33:03.5293582Z             scale_ub_tensor = None
2025-05-07T20:33:03.5294025Z     
2025-05-07T20:33:03.5294399Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5294940Z             op = silu_mul_quant
2025-05-07T20:33:03.5295352Z             if compiled:
2025-05-07T20:33:03.5295754Z                 op = torch.compile(op)
2025-05-07T20:33:03.5296243Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5296708Z     
2025-05-07T20:33:03.5297022Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5297293Z 
2025-05-07T20:33:03.5297606Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5298115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5298675Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5299130Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5300345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5301576Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5302545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5303710Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5306364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5307271Z     kernel = self.compile(
2025-05-07T20:33:03.5308170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5309253Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5310025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5310435Z 
2025-05-07T20:33:03.5310897Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8b0b580>
2025-05-07T20:33:03.5312797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5315224Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8ad85e0>}
2025-05-07T20:33:03.5317712Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5319519Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8b21830>
2025-05-07T20:33:03.5319999Z 
2025-05-07T20:33:03.5320284Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5321170Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5321985Z                            module_map=module_map)
2025-05-07T20:33:03.5322597Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5323177Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5323617Z E       ^
2025-05-07T20:33:03.5324418Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5325224Z 
2025-05-07T20:33:03.5325971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5326859Z 
2025-05-07T20:33:03.5327027Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5327729Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5328417Z     T=4096,
2025-05-07T20:33:03.5328716Z     D=7168,
2025-05-07T20:33:03.5329034Z     scale_ub=1200.0,
2025-05-07T20:33:03.5329398Z     contiguous=False,
2025-05-07T20:33:03.5329760Z     compiled=False,
2025-05-07T20:33:03.5330101Z )
2025-05-07T20:33:03.5330631Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5331481Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.5331988Z 
2025-05-07T20:33:03.5332135Z     @given(
2025-05-07T20:33:03.5332513Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5333024Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5333517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5334163Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5334726Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5335197Z     )
2025-05-07T20:33:03.5335786Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5336552Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5336959Z         self,
2025-05-07T20:33:03.5337270Z         T: int,
2025-05-07T20:33:03.5337595Z         D: int,
2025-05-07T20:33:03.5337950Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5338387Z         contiguous: bool,
2025-05-07T20:33:03.5338780Z         compiled: bool,
2025-05-07T20:33:03.5339133Z     ) -> None:
2025-05-07T20:33:03.5339472Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5339870Z     
2025-05-07T20:33:03.5340310Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5340887Z     
2025-05-07T20:33:03.5341198Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5341676Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5342184Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5342581Z         x0 = x[:, :D]
2025-05-07T20:33:03.5342939Z         x1 = x[:, D:]
2025-05-07T20:33:03.5343272Z     
2025-05-07T20:33:03.5343572Z         if contiguous:
2025-05-07T20:33:03.5344074Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5344494Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5344950Z     
2025-05-07T20:33:03.5345259Z         if scale_ub is not None:
2025-05-07T20:33:03.5345709Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5346254Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5346752Z             )
2025-05-07T20:33:03.5347060Z         else:
2025-05-07T20:33:03.5348087Z             scale_ub_tensor = None
2025-05-07T20:33:03.5348504Z     
2025-05-07T20:33:03.5348875Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5349392Z             op = silu_mul_quant
2025-05-07T20:33:03.5349974Z             if compiled:
2025-05-07T20:33:03.5350388Z                 op = torch.compile(op)
2025-05-07T20:33:03.5350851Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5351288Z     
2025-05-07T20:33:03.5351579Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5351826Z 
2025-05-07T20:33:03.5351979Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5352449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5353005Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5353471Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5354636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5355868Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5356797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5357987Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5359146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5360083Z     kernel = self.compile(
2025-05-07T20:33:03.5361025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5362158Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5362835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5363222Z 
2025-05-07T20:33:03.5363564Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8a835e0>
2025-05-07T20:33:03.5365491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5367923Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8c291f0>}
2025-05-07T20:33:03.5370101Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5371902Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8c27bb0>
2025-05-07T20:33:03.5372399Z 
2025-05-07T20:33:03.5372683Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5373579Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5374388Z                            module_map=module_map)
2025-05-07T20:33:03.5374997Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5375590Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5376012Z E       ^
2025-05-07T20:33:03.5376811Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5377611Z 
2025-05-07T20:33:03.5378418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5379326Z 
2025-05-07T20:33:03.5379562Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5380244Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5380927Z     T=16384,
2025-05-07T20:33:03.5381242Z     D=7168,
2025-05-07T20:33:03.5381549Z     scale_ub=None,
2025-05-07T20:33:03.5381901Z     contiguous=True,
2025-05-07T20:33:03.5382404Z     compiled=True,
2025-05-07T20:33:03.5383007Z )
2025-05-07T20:33:03.8233629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.8234597Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:03.8235096Z 
2025-05-07T20:33:03.8235225Z     @given(
2025-05-07T20:33:03.8235612Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.8236117Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.8236597Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.8237106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.8237599Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.8238056Z     )
2025-05-07T20:33:03.8238645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.8239402Z     def test_silu_mul_quant(
2025-05-07T20:33:03.8239791Z         self,
2025-05-07T20:33:03.8240108Z         T: int,
2025-05-07T20:33:03.8240437Z         D: int,
2025-05-07T20:33:03.8240779Z         scale_ub: Optional[float],
2025-05-07T20:33:03.8241232Z         contiguous: bool,
2025-05-07T20:33:03.8241642Z         compiled: bool,
2025-05-07T20:33:03.8242048Z     ) -> None:
2025-05-07T20:33:03.8242398Z         torch.manual_seed(2025)
2025-05-07T20:33:03.8242813Z     
2025-05-07T20:33:03.8243250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.8243825Z     
2025-05-07T20:33:03.8244139Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.8244615Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.8256113Z         x = x_sign * x_clamp
2025-05-07T20:33:03.8256531Z         x0 = x[:, :D]
2025-05-07T20:33:03.8256882Z         x1 = x[:, D:]
2025-05-07T20:33:03.8257227Z     
2025-05-07T20:33:03.8257536Z         if contiguous:
2025-05-07T20:33:03.8257909Z             x0 = x0.contiguous()
2025-05-07T20:33:03.8258345Z             x1 = x1.contiguous()
2025-05-07T20:33:03.8258755Z     
2025-05-07T20:33:03.8259062Z         if scale_ub is not None:
2025-05-07T20:33:03.8259518Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.8260391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.8260908Z             )
2025-05-07T20:33:03.8261223Z         else:
2025-05-07T20:33:03.8261574Z             scale_ub_tensor = None
2025-05-07T20:33:03.8261990Z     
2025-05-07T20:33:03.8262383Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.8262904Z             op = silu_mul_quant
2025-05-07T20:33:03.8263320Z             if compiled:
2025-05-07T20:33:03.8263728Z                 op = torch.compile(op)
2025-05-07T20:33:03.8264215Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.8264678Z     
2025-05-07T20:33:03.8264979Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.8265264Z 
2025-05-07T20:33:03.8265428Z moe/activation_test.py:117: 
2025-05-07T20:33:03.8265915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.8266472Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.8266947Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.8267920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.8268890Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.8270220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.8271574Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.8272557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.8273842Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.8275015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.8275935Z     kernel = self.compile(
2025-05-07T20:33:03.8276997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.8278120Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.8278799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.8279198Z 
2025-05-07T20:33:03.8279551Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8c13100>
2025-05-07T20:33:03.8281455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.8284301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8c29ee0>}
2025-05-07T20:33:03.8286713Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.8288539Z context = <triton._C.libtriton.ir.context object at 0x7fd1b895bbf0>
2025-05-07T20:33:03.8289040Z 
2025-05-07T20:33:03.8289325Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.8290221Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.8291037Z                            module_map=module_map)
2025-05-07T20:33:03.8291646Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.8292238Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.8292665Z E       ^
2025-05-07T20:33:03.8293472Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.8294275Z 
2025-05-07T20:33:03.8295021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.8295930Z 
2025-05-07T20:33:03.8296110Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.8296912Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.8297616Z     T=4096,
2025-05-07T20:33:03.8297931Z     D=5120,
2025-05-07T20:33:03.8298239Z     scale_ub=None,
2025-05-07T20:33:03.8298593Z     contiguous=False,
2025-05-07T20:33:03.8298961Z     compiled=True,
2025-05-07T20:33:03.8299294Z )
2025-05-07T20:33:03.8299827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.8300679Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.8301151Z 
2025-05-07T20:33:03.8301277Z     @given(
2025-05-07T20:33:03.8301657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.8302234Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.8302751Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.8303266Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.8303760Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.8304144Z     )
2025-05-07T20:33:03.8304607Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.8305246Z     def test_silu_mul_quant(
2025-05-07T20:33:03.8305592Z         self,
2025-05-07T20:33:03.8305838Z         T: int,
2025-05-07T20:33:03.8306216Z         D: int,
2025-05-07T20:33:03.8306522Z         scale_ub: Optional[float],
2025-05-07T20:33:03.8306972Z         contiguous: bool,
2025-05-07T20:33:03.8307302Z         compiled: bool,
2025-05-07T20:33:03.8307615Z     ) -> None:
2025-05-07T20:33:03.8307905Z         torch.manual_seed(2025)
2025-05-07T20:33:03.8308261Z     
2025-05-07T20:33:03.8308649Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.8309313Z     
2025-05-07T20:33:03.8309586Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.8310114Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.8310558Z         x = x_sign * x_clamp
2025-05-07T20:33:03.8310908Z         x0 = x[:, :D]
2025-05-07T20:33:03.8311228Z         x1 = x[:, D:]
2025-05-07T20:33:03.8311545Z     
2025-05-07T20:33:03.8311809Z         if contiguous:
2025-05-07T20:33:03.8312149Z             x0 = x0.contiguous()
2025-05-07T20:33:03.8312522Z             x1 = x1.contiguous()
2025-05-07T20:33:03.8312874Z     
2025-05-07T20:33:03.8313152Z         if scale_ub is not None:
2025-05-07T20:33:03.8313553Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.8314066Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.8314503Z             )
2025-05-07T20:33:03.8314792Z         else:
2025-05-07T20:33:03.8315111Z             scale_ub_tensor = None
2025-05-07T20:33:03.8315518Z     
2025-05-07T20:33:03.8315880Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.8316397Z             op = silu_mul_quant
2025-05-07T20:33:03.8316787Z             if compiled:
2025-05-07T20:33:03.8317176Z                 op = torch.compile(op)
2025-05-07T20:33:03.8317654Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.8318092Z     
2025-05-07T20:33:03.8318393Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.8318653Z 
2025-05-07T20:33:03.8318812Z moe/activation_test.py:117: 
2025-05-07T20:33:03.8319278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.8319822Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.8320271Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.8321189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.8322129Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.8323246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.8324375Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.8325332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.8326453Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.8327542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.8328444Z     kernel = self.compile(
2025-05-07T20:33:03.8329362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.8330464Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.8331143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.8331543Z 
2025-05-07T20:33:03.8331873Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b897b5b0>
2025-05-07T20:33:03.8333729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.8336134Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8987940>}
2025-05-07T20:33:03.8338604Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.8340450Z context = <triton._C.libtriton.ir.context object at 0x7fd1b86a3eb0>
2025-05-07T20:33:03.8340947Z 
2025-05-07T20:33:03.8341202Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.8342040Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.8342919Z                            module_map=module_map)
2025-05-07T20:33:03.8343507Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.8344092Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.8344520Z E       ^
2025-05-07T20:33:03.8345310Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.8346097Z 
2025-05-07T20:33:03.8346819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.8347722Z 
2025-05-07T20:33:04.0277783Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0278604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0279293Z     T=4096,
2025-05-07T20:33:04.0279598Z     D=5120,
2025-05-07T20:33:04.0279908Z     scale_ub=1200.0,
2025-05-07T20:33:04.0280292Z     contiguous=False,
2025-05-07T20:33:04.0280646Z     compiled=False,
2025-05-07T20:33:04.0280971Z )
2025-05-07T20:33:04.0281445Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.0282283Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:04.0283066Z 
2025-05-07T20:33:04.0283223Z     @given(
2025-05-07T20:33:04.0283589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.0284117Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.0284638Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.0285199Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.0285739Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.0286221Z     )
2025-05-07T20:33:04.0286804Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.0287556Z     def test_silu_mul_quant(
2025-05-07T20:33:04.0287959Z         self,
2025-05-07T20:33:04.0288274Z         T: int,
2025-05-07T20:33:04.0288584Z         D: int,
2025-05-07T20:33:04.0288937Z         scale_ub: Optional[float],
2025-05-07T20:33:04.0289386Z         contiguous: bool,
2025-05-07T20:33:04.0290077Z         compiled: bool,
2025-05-07T20:33:04.0290458Z     ) -> None:
2025-05-07T20:33:04.0290807Z         torch.manual_seed(2025)
2025-05-07T20:33:04.0291201Z     
2025-05-07T20:33:04.0291645Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.0292229Z     
2025-05-07T20:33:04.0292543Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.0293034Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.0293572Z         x = x_sign * x_clamp
2025-05-07T20:33:04.0293967Z         x0 = x[:, :D]
2025-05-07T20:33:04.0294318Z         x1 = x[:, D:]
2025-05-07T20:33:04.0294656Z     
2025-05-07T20:33:04.0294958Z         if contiguous:
2025-05-07T20:33:04.0295325Z             x0 = x0.contiguous()
2025-05-07T20:33:04.0295756Z             x1 = x1.contiguous()
2025-05-07T20:33:04.0296161Z     
2025-05-07T20:33:04.0296498Z         if scale_ub is not None:
2025-05-07T20:33:04.0296940Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.0297497Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.0298014Z             )
2025-05-07T20:33:04.0298315Z         else:
2025-05-07T20:33:04.0298656Z             scale_ub_tensor = None
2025-05-07T20:33:04.0299070Z     
2025-05-07T20:33:04.0299439Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0300086Z             op = silu_mul_quant
2025-05-07T20:33:04.0300623Z             if compiled:
2025-05-07T20:33:04.0301017Z                 op = torch.compile(op)
2025-05-07T20:33:04.0301493Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0301947Z     
2025-05-07T20:33:04.0302253Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.0302520Z 
2025-05-07T20:33:04.0302679Z moe/activation_test.py:117: 
2025-05-07T20:33:04.0303288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0303850Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.0304299Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0305473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.0306665Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.0307580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.0308742Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.0310049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.0310963Z     kernel = self.compile(
2025-05-07T20:33:04.0311904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.0313063Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.0313740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0314142Z 
2025-05-07T20:33:04.0314490Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8985df0>
2025-05-07T20:33:04.0316384Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.0318860Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b872f3a0>}
2025-05-07T20:33:04.0321254Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.0323061Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8736e70>
2025-05-07T20:33:04.0323560Z 
2025-05-07T20:33:04.0323932Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.0324831Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.0325636Z                            module_map=module_map)
2025-05-07T20:33:04.0326232Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.0326814Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.0327255Z E       ^
2025-05-07T20:33:04.0328055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.0328861Z 
2025-05-07T20:33:04.0329599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.0330514Z 
2025-05-07T20:33:04.0330682Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0331383Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0332079Z     T=4096,
2025-05-07T20:33:04.0332391Z     D=5120,
2025-05-07T20:33:04.0332698Z     scale_ub=1200.0,
2025-05-07T20:33:04.0333067Z     contiguous=False,
2025-05-07T20:33:04.0333435Z     compiled=True,
2025-05-07T20:33:04.0333762Z )
2025-05-07T20:33:04.0334290Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.0335241Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:04.0335767Z 
2025-05-07T20:33:04.0335892Z     @given(
2025-05-07T20:33:04.0336260Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.0336790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.0337294Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.0337845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.0338465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.0338949Z     )
2025-05-07T20:33:04.0339535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.0340300Z     def test_silu_mul_quant(
2025-05-07T20:33:04.0340701Z         self,
2025-05-07T20:33:04.0341012Z         T: int,
2025-05-07T20:33:04.0341333Z         D: int,
2025-05-07T20:33:04.0341685Z         scale_ub: Optional[float],
2025-05-07T20:33:04.0342127Z         contiguous: bool,
2025-05-07T20:33:04.0342526Z         compiled: bool,
2025-05-07T20:33:04.0342893Z     ) -> None:
2025-05-07T20:33:04.0343237Z         torch.manual_seed(2025)
2025-05-07T20:33:04.0343640Z     
2025-05-07T20:33:04.0344085Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.0344661Z     
2025-05-07T20:33:04.0344982Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.0345460Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.0345977Z         x = x_sign * x_clamp
2025-05-07T20:33:04.0346372Z         x0 = x[:, :D]
2025-05-07T20:33:04.0346729Z         x1 = x[:, D:]
2025-05-07T20:33:04.0347072Z     
2025-05-07T20:33:04.0347368Z         if contiguous:
2025-05-07T20:33:04.0347750Z             x0 = x0.contiguous()
2025-05-07T20:33:04.0348171Z             x1 = x1.contiguous()
2025-05-07T20:33:04.0348550Z     
2025-05-07T20:33:04.0348809Z         if scale_ub is not None:
2025-05-07T20:33:04.0349170Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.0349604Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.0350218Z             )
2025-05-07T20:33:04.0350499Z         else:
2025-05-07T20:33:04.0350773Z             scale_ub_tensor = None
2025-05-07T20:33:04.0351126Z     
2025-05-07T20:33:04.0351437Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0351877Z             op = silu_mul_quant
2025-05-07T20:33:04.0352240Z             if compiled:
2025-05-07T20:33:04.0352585Z                 op = torch.compile(op)
2025-05-07T20:33:04.0352982Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0353367Z     
2025-05-07T20:33:04.0353713Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.0353956Z 
2025-05-07T20:33:04.0354101Z moe/activation_test.py:117: 
2025-05-07T20:33:04.0354546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0355042Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.0355445Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0356290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.0357154Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.0358183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.0359358Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.0360243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.0361394Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.0362507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.0363337Z     kernel = self.compile(
2025-05-07T20:33:04.0364222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.0365390Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.0366095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0366480Z 
2025-05-07T20:33:04.0366816Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8728970>
2025-05-07T20:33:04.0368662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.0371188Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b872f280>}
2025-05-07T20:33:04.0373524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.0375295Z context = <triton._C.libtriton.ir.context object at 0x7fd1b858be70>
2025-05-07T20:33:04.0375808Z 
2025-05-07T20:33:04.0376084Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.0376978Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.0377773Z                            module_map=module_map)
2025-05-07T20:33:04.0378363Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.0378945Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.0379381Z E       ^
2025-05-07T20:33:04.0380176Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.0380950Z 
2025-05-07T20:33:04.0381687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.0382641Z 
2025-05-07T20:33:04.3116759Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.3117982Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.3118742Z     T=2048,
2025-05-07T20:33:04.3119054Z     D=7168,
2025-05-07T20:33:04.3119358Z     scale_ub=1200.0,
2025-05-07T20:33:04.3119714Z     contiguous=False,
2025-05-07T20:33:04.3120075Z     compiled=False,
2025-05-07T20:33:04.3120408Z )
2025-05-07T20:33:04.3120937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.3121796Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:04.3122244Z 
2025-05-07T20:33:04.3122669Z     @given(
2025-05-07T20:33:04.3123002Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.3123446Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.3123895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.3124394Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.3124894Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.3125349Z     )
2025-05-07T20:33:04.3125906Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.3126597Z     def test_silu_mul_quant(
2025-05-07T20:33:04.3126954Z         self,
2025-05-07T20:33:04.3127242Z         T: int,
2025-05-07T20:33:04.3127526Z         D: int,
2025-05-07T20:33:04.3127850Z         scale_ub: Optional[float],
2025-05-07T20:33:04.3128292Z         contiguous: bool,
2025-05-07T20:33:04.3128663Z         compiled: bool,
2025-05-07T20:33:04.3129012Z     ) -> None:
2025-05-07T20:33:04.3129359Z         torch.manual_seed(2025)
2025-05-07T20:33:04.3129737Z     
2025-05-07T20:33:04.3130159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.3130747Z     
2025-05-07T20:33:04.3131066Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.3131526Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.3132231Z         x = x_sign * x_clamp
2025-05-07T20:33:04.3132641Z         x0 = x[:, :D]
2025-05-07T20:33:04.3133108Z         x1 = x[:, D:]
2025-05-07T20:33:04.3133440Z     
2025-05-07T20:33:04.3133738Z         if contiguous:
2025-05-07T20:33:04.3134096Z             x0 = x0.contiguous()
2025-05-07T20:33:04.3134516Z             x1 = x1.contiguous()
2025-05-07T20:33:04.3134912Z     
2025-05-07T20:33:04.3135216Z         if scale_ub is not None:
2025-05-07T20:33:04.3135795Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.3136350Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.3136864Z             )
2025-05-07T20:33:04.3137168Z         else:
2025-05-07T20:33:04.3137521Z             scale_ub_tensor = None
2025-05-07T20:33:04.3137936Z     
2025-05-07T20:33:04.3138303Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.3138825Z             op = silu_mul_quant
2025-05-07T20:33:04.3139232Z             if compiled:
2025-05-07T20:33:04.3139628Z                 op = torch.compile(op)
2025-05-07T20:33:04.3140116Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.3140577Z     
2025-05-07T20:33:04.3140878Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.3141161Z 
2025-05-07T20:33:04.3141325Z moe/activation_test.py:117: 
2025-05-07T20:33:04.3141811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.3142360Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.3142835Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.3144038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.3145237Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.3146153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.3147343Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.3148503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.3149446Z     kernel = self.compile(
2025-05-07T20:33:04.3150565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.3151735Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.3152473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.3152864Z 
2025-05-07T20:33:04.3153207Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8561ee0>
2025-05-07T20:33:04.3155186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.3157573Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b856d670>}
2025-05-07T20:33:04.3159951Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.3171521Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8665030>
2025-05-07T20:33:04.3172071Z 
2025-05-07T20:33:04.3172354Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.3173261Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.3174062Z                            module_map=module_map)
2025-05-07T20:33:04.3174666Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.3175242Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.3175676Z E       ^
2025-05-07T20:33:04.3176616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.3177482Z 
2025-05-07T20:33:04.3178212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.3179126Z 
2025-05-07T20:33:04.3179294Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.3179978Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.3180726Z     T=1,
2025-05-07T20:33:04.3181019Z     D=7168,
2025-05-07T20:33:04.3181333Z     scale_ub=None,
2025-05-07T20:33:04.3181681Z     contiguous=True,
2025-05-07T20:33:04.3182039Z     compiled=False,
2025-05-07T20:33:04.3182374Z )
2025-05-07T20:33:04.3183236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.3184057Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:04.3184508Z 
2025-05-07T20:33:04.3184636Z     @given(
2025-05-07T20:33:04.3185006Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.3185528Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.3186027Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.3186571Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.3187101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.3187560Z     )
2025-05-07T20:33:04.3188156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.3188916Z     def test_silu_mul_quant(
2025-05-07T20:33:04.3189303Z         self,
2025-05-07T20:33:04.3189621Z         T: int,
2025-05-07T20:33:04.3190061Z         D: int,
2025-05-07T20:33:04.3190401Z         scale_ub: Optional[float],
2025-05-07T20:33:04.3190853Z         contiguous: bool,
2025-05-07T20:33:04.3191246Z         compiled: bool,
2025-05-07T20:33:04.3191592Z     ) -> None:
2025-05-07T20:33:04.3191938Z         torch.manual_seed(2025)
2025-05-07T20:33:04.3192330Z     
2025-05-07T20:33:04.3192768Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.3193350Z     
2025-05-07T20:33:04.3193648Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.3194118Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.3194618Z         x = x_sign * x_clamp
2025-05-07T20:33:04.3194999Z         x0 = x[:, :D]
2025-05-07T20:33:04.3195337Z         x1 = x[:, D:]
2025-05-07T20:33:04.3195662Z     
2025-05-07T20:33:04.3195944Z         if contiguous:
2025-05-07T20:33:04.3196315Z             x0 = x0.contiguous()
2025-05-07T20:33:04.3196732Z             x1 = x1.contiguous()
2025-05-07T20:33:04.3197250Z     
2025-05-07T20:33:04.3197567Z         if scale_ub is not None:
2025-05-07T20:33:04.3198014Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.3198567Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.3199068Z             )
2025-05-07T20:33:04.3199374Z         else:
2025-05-07T20:33:04.3199707Z             scale_ub_tensor = None
2025-05-07T20:33:04.3200091Z     
2025-05-07T20:33:04.3200453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.3200976Z             op = silu_mul_quant
2025-05-07T20:33:04.3201381Z             if compiled:
2025-05-07T20:33:04.3201779Z                 op = torch.compile(op)
2025-05-07T20:33:04.3202260Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.3202708Z     
2025-05-07T20:33:04.3203015Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.3203290Z 
2025-05-07T20:33:04.3203459Z moe/activation_test.py:117: 
2025-05-07T20:33:04.3203943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.3204501Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.3204965Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.3206270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.3207485Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.3208491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.3209688Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.3210805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.3211819Z     kernel = self.compile(
2025-05-07T20:33:04.3212772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.3213894Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.3214553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.3214963Z 
2025-05-07T20:33:04.3215295Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b83c39a0>
2025-05-07T20:33:04.3216996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.3219438Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8634280>}
2025-05-07T20:33:04.3221835Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.3223621Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8622a30>
2025-05-07T20:33:04.3224118Z 
2025-05-07T20:33:04.3224393Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.3225288Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.3226080Z                            module_map=module_map)
2025-05-07T20:33:04.3226684Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.3227268Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.3227696Z E       ^
2025-05-07T20:33:04.3228476Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.3229281Z 
2025-05-07T20:33:04.3230153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.3231056Z 
2025-05-07T20:33:04.3231319Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.3232013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.3232689Z     T=16384,
2025-05-07T20:33:04.3232995Z     D=7168,
2025-05-07T20:33:04.3233303Z     scale_ub=1200.0,
2025-05-07T20:33:04.3233652Z     contiguous=False,
2025-05-07T20:33:04.3234018Z     compiled=True,
2025-05-07T20:33:04.3234343Z )
2025-05-07T20:33:04.5142770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.5143706Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:04.5144182Z 
2025-05-07T20:33:04.5144314Z     @given(
2025-05-07T20:33:04.5144690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.5145230Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.5145721Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.5146285Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.5146837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.5147226Z     )
2025-05-07T20:33:04.5147720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.5148368Z     def test_silu_mul_quant(
2025-05-07T20:33:04.5148718Z         self,
2025-05-07T20:33:04.5149269Z         T: int,
2025-05-07T20:33:04.5149570Z         D: int,
2025-05-07T20:33:04.5150160Z         scale_ub: Optional[float],
2025-05-07T20:33:04.5150583Z         contiguous: bool,
2025-05-07T20:33:04.5150966Z         compiled: bool,
2025-05-07T20:33:04.5151298Z     ) -> None:
2025-05-07T20:33:04.5151597Z         torch.manual_seed(2025)
2025-05-07T20:33:04.5151962Z     
2025-05-07T20:33:04.5152363Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.5153060Z     
2025-05-07T20:33:04.5153356Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.5153817Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.5154306Z         x = x_sign * x_clamp
2025-05-07T20:33:04.5154696Z         x0 = x[:, :D]
2025-05-07T20:33:04.5155057Z         x1 = x[:, D:]
2025-05-07T20:33:04.5155402Z     
2025-05-07T20:33:04.5155700Z         if contiguous:
2025-05-07T20:33:04.5156053Z             x0 = x0.contiguous()
2025-05-07T20:33:04.5156453Z             x1 = x1.contiguous()
2025-05-07T20:33:04.5156856Z     
2025-05-07T20:33:04.5157167Z         if scale_ub is not None:
2025-05-07T20:33:04.5157614Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.5158150Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.5158654Z             )
2025-05-07T20:33:04.5158981Z         else:
2025-05-07T20:33:04.5159327Z             scale_ub_tensor = None
2025-05-07T20:33:04.5159745Z     
2025-05-07T20:33:04.5160122Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.5160642Z             op = silu_mul_quant
2025-05-07T20:33:04.5161047Z             if compiled:
2025-05-07T20:33:04.5161444Z                 op = torch.compile(op)
2025-05-07T20:33:04.5161917Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5162364Z     
2025-05-07T20:33:04.5162675Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.5162941Z 
2025-05-07T20:33:04.5163102Z moe/activation_test.py:117: 
2025-05-07T20:33:04.5163597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.5164161Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.5164623Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5165568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.5166539Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.5167678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.5168873Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.5169945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.5171139Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.5172341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.5173264Z     kernel = self.compile(
2025-05-07T20:33:04.5174203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.5175346Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.5176015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.5176423Z 
2025-05-07T20:33:04.5176757Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b83a62b0>
2025-05-07T20:33:04.5178653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.5181051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8634ee0>}
2025-05-07T20:33:04.5183870Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.5185707Z context = <triton._C.libtriton.ir.context object at 0x7fd1b87a02f0>
2025-05-07T20:33:04.5186207Z 
2025-05-07T20:33:04.5186480Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.5187465Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.5188265Z                            module_map=module_map)
2025-05-07T20:33:04.5188859Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.5189438Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.5189981Z E       ^
2025-05-07T20:33:04.5190769Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.5191569Z 
2025-05-07T20:33:04.5192286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.5193210Z 
2025-05-07T20:33:04.5193377Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.5194068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.5194738Z     T=1,
2025-05-07T20:33:04.5195049Z     D=7168,
2025-05-07T20:33:04.5195356Z     scale_ub=None,
2025-05-07T20:33:04.5195703Z     contiguous=False,
2025-05-07T20:33:04.5196078Z     compiled=False,
2025-05-07T20:33:04.5196402Z )
2025-05-07T20:33:04.5196923Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.5197753Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:04.5198181Z 
2025-05-07T20:33:04.5198308Z     @given(
2025-05-07T20:33:04.5198667Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.5199185Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.5199687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.5200225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.5200774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.5201252Z     )
2025-05-07T20:33:04.5201829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.5202580Z     def test_silu_mul_quant(
2025-05-07T20:33:04.5202976Z         self,
2025-05-07T20:33:04.5203275Z         T: int,
2025-05-07T20:33:04.5203586Z         D: int,
2025-05-07T20:33:04.5203933Z         scale_ub: Optional[float],
2025-05-07T20:33:04.5204519Z         contiguous: bool,
2025-05-07T20:33:04.5204899Z         compiled: bool,
2025-05-07T20:33:04.5205255Z     ) -> None:
2025-05-07T20:33:04.5205593Z         torch.manual_seed(2025)
2025-05-07T20:33:04.5205976Z     
2025-05-07T20:33:04.5206415Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.5206996Z     
2025-05-07T20:33:04.5207290Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.5207763Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.5208279Z         x = x_sign * x_clamp
2025-05-07T20:33:04.5208656Z         x0 = x[:, :D]
2025-05-07T20:33:04.5209001Z         x1 = x[:, D:]
2025-05-07T20:33:04.5209333Z     
2025-05-07T20:33:04.5209620Z         if contiguous:
2025-05-07T20:33:04.5209992Z             x0 = x0.contiguous()
2025-05-07T20:33:04.5210417Z             x1 = x1.contiguous()
2025-05-07T20:33:04.5210802Z     
2025-05-07T20:33:04.5211096Z         if scale_ub is not None:
2025-05-07T20:33:04.5211540Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.5212075Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.5212585Z             )
2025-05-07T20:33:04.5212891Z         else:
2025-05-07T20:33:04.5213222Z             scale_ub_tensor = None
2025-05-07T20:33:04.5213626Z     
2025-05-07T20:33:04.5214070Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.5214600Z             op = silu_mul_quant
2025-05-07T20:33:04.5215053Z             if compiled:
2025-05-07T20:33:04.5215448Z                 op = torch.compile(op)
2025-05-07T20:33:04.5215926Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5216368Z     
2025-05-07T20:33:04.5216673Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.5216944Z 
2025-05-07T20:33:04.5217179Z moe/activation_test.py:117: 
2025-05-07T20:33:04.5217655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.5218215Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.5218669Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5219848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.5221049Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.5221963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.5223154Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.5224245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.5225166Z     kernel = self.compile(
2025-05-07T20:33:04.5226104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.5227264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.5227937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.5228351Z 
2025-05-07T20:33:04.5228693Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b87ba370>
2025-05-07T20:33:04.5230694Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.5233177Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b87bb670>}
2025-05-07T20:33:04.5235525Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.5237309Z context = <triton._C.libtriton.ir.context object at 0x7fd1b83fb2b0>
2025-05-07T20:33:04.5237812Z 
2025-05-07T20:33:04.5238171Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.5239087Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.5239822Z                            module_map=module_map)
2025-05-07T20:33:04.5240356Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.5240910Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.5241334Z E       ^
2025-05-07T20:33:04.5242118Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.5242925Z 
2025-05-07T20:33:04.5243651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.5244560Z 
2025-05-07T20:33:04.5244734Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.5245426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.5246106Z     T=2048,
2025-05-07T20:33:04.5246410Z     D=7168,
2025-05-07T20:33:04.5246717Z     scale_ub=None,
2025-05-07T20:33:04.5247061Z     contiguous=False,
2025-05-07T20:33:04.5247423Z     compiled=True,
2025-05-07T20:33:04.5247751Z )
2025-05-07T20:33:04.8154529Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.8155488Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:04.8156100Z 
2025-05-07T20:33:04.8156230Z     @given(
2025-05-07T20:33:04.8156600Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.8157092Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.8157558Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.8158161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.8158692Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.8159158Z     )
2025-05-07T20:33:04.8159756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.8160522Z     def test_silu_mul_quant(
2025-05-07T20:33:04.8160927Z         self,
2025-05-07T20:33:04.8161232Z         T: int,
2025-05-07T20:33:04.8161551Z         D: int,
2025-05-07T20:33:04.8161906Z         scale_ub: Optional[float],
2025-05-07T20:33:04.8162347Z         contiguous: bool,
2025-05-07T20:33:04.8162745Z         compiled: bool,
2025-05-07T20:33:04.8163119Z     ) -> None:
2025-05-07T20:33:04.8163459Z         torch.manual_seed(2025)
2025-05-07T20:33:04.8163856Z     
2025-05-07T20:33:04.8164295Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.8164870Z     
2025-05-07T20:33:04.8165181Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.8165658Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.8166175Z         x = x_sign * x_clamp
2025-05-07T20:33:04.8166569Z         x0 = x[:, :D]
2025-05-07T20:33:04.8166918Z         x1 = x[:, D:]
2025-05-07T20:33:04.8167254Z     
2025-05-07T20:33:04.8167551Z         if contiguous:
2025-05-07T20:33:04.8167927Z             x0 = x0.contiguous()
2025-05-07T20:33:04.8168352Z             x1 = x1.contiguous()
2025-05-07T20:33:04.8168743Z     
2025-05-07T20:33:04.8169054Z         if scale_ub is not None:
2025-05-07T20:33:04.8169506Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.8170062Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.8170584Z             )
2025-05-07T20:33:04.8170903Z         else:
2025-05-07T20:33:04.8171243Z             scale_ub_tensor = None
2025-05-07T20:33:04.8171667Z     
2025-05-07T20:33:04.8172043Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.8172566Z             op = silu_mul_quant
2025-05-07T20:33:04.8172981Z             if compiled:
2025-05-07T20:33:04.8173386Z                 op = torch.compile(op)
2025-05-07T20:33:04.8173865Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.8174323Z     
2025-05-07T20:33:04.8174762Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.8175044Z 
2025-05-07T20:33:04.8175222Z moe/activation_test.py:117: 
2025-05-07T20:33:04.8175699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8176255Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.8176721Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.8177670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.8178641Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.8179791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.8180988Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.8181886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.8183405Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.8184546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.8185449Z     kernel = self.compile(
2025-05-07T20:33:04.8186471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.8187594Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.8188340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8188735Z 
2025-05-07T20:33:04.8189074Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b85d6670>
2025-05-07T20:33:04.8191124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.8193702Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b85ef550>}
2025-05-07T20:33:04.8196087Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.8197889Z context = <triton._C.libtriton.ir.context object at 0x7fd1b827e570>
2025-05-07T20:33:04.8198398Z 
2025-05-07T20:33:04.8198671Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.8199564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.8200382Z                            module_map=module_map)
2025-05-07T20:33:04.8200977Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.8201566Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.8202005Z E       ^
2025-05-07T20:33:04.8202792Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.8203602Z 
2025-05-07T20:33:04.8204329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.8205247Z 
2025-05-07T20:33:04.8205413Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.8206109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.8206788Z     T=4096,
2025-05-07T20:33:04.8207089Z     D=7168,
2025-05-07T20:33:04.8207404Z     scale_ub=None,
2025-05-07T20:33:04.8207743Z     contiguous=False,
2025-05-07T20:33:04.8208107Z     compiled=True,
2025-05-07T20:33:04.8208442Z )
2025-05-07T20:33:04.8208964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.8209805Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:04.8210443Z 
2025-05-07T20:33:04.8210576Z     @given(
2025-05-07T20:33:04.8210947Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.8211456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.8211962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.8212522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.8213064Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.8213552Z     )
2025-05-07T20:33:04.8214139Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.8214889Z     def test_silu_mul_quant(
2025-05-07T20:33:04.8215284Z         self,
2025-05-07T20:33:04.8215598Z         T: int,
2025-05-07T20:33:04.8215914Z         D: int,
2025-05-07T20:33:04.8216260Z         scale_ub: Optional[float],
2025-05-07T20:33:04.8216701Z         contiguous: bool,
2025-05-07T20:33:04.8217094Z         compiled: bool,
2025-05-07T20:33:04.8217447Z     ) -> None:
2025-05-07T20:33:04.8217799Z         torch.manual_seed(2025)
2025-05-07T20:33:04.8218198Z     
2025-05-07T20:33:04.8218652Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.8219236Z     
2025-05-07T20:33:04.8219545Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.8220087Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.8220615Z         x = x_sign * x_clamp
2025-05-07T20:33:04.8221065Z         x0 = x[:, :D]
2025-05-07T20:33:04.8230620Z         x1 = x[:, D:]
2025-05-07T20:33:04.8230950Z     
2025-05-07T20:33:04.8231227Z         if contiguous:
2025-05-07T20:33:04.8231563Z             x0 = x0.contiguous()
2025-05-07T20:33:04.8231955Z             x1 = x1.contiguous()
2025-05-07T20:33:04.8232316Z     
2025-05-07T20:33:04.8232754Z         if scale_ub is not None:
2025-05-07T20:33:04.8233159Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.8233655Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.8234096Z             )
2025-05-07T20:33:04.8234385Z         else:
2025-05-07T20:33:04.8234693Z             scale_ub_tensor = None
2025-05-07T20:33:04.8235067Z     
2025-05-07T20:33:04.8235405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.8235876Z             op = silu_mul_quant
2025-05-07T20:33:04.8236237Z             if compiled:
2025-05-07T20:33:04.8236613Z                 op = torch.compile(op)
2025-05-07T20:33:04.8237071Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.8237463Z     
2025-05-07T20:33:04.8237764Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.8238009Z 
2025-05-07T20:33:04.8238147Z moe/activation_test.py:117: 
2025-05-07T20:33:04.8238579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8239116Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.8239566Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.8240503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.8241432Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.8242544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.8243719Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.8244616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.8245758Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.8246873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.8247763Z     kernel = self.compile(
2025-05-07T20:33:04.8248667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.8249762Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.8250508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8250889Z 
2025-05-07T20:33:04.8251220Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b82619a0>
2025-05-07T20:33:04.8253070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.8255465Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8367160>}
2025-05-07T20:33:04.8257779Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.8259533Z context = <triton._C.libtriton.ir.context object at 0x7fd1b836a5b0>
2025-05-07T20:33:04.8260010Z 
2025-05-07T20:33:04.8260271Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.8261137Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.8261963Z                            module_map=module_map)
2025-05-07T20:33:04.8262537Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.8263148Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.8263561Z E       ^
2025-05-07T20:33:04.8264324Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.8265089Z 
2025-05-07T20:33:04.8265786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.8266714Z 
2025-05-07T20:33:05.0312302Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.0313122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.0313822Z     T=16384,
2025-05-07T20:33:05.0314137Z     D=5120,
2025-05-07T20:33:05.0314436Z     scale_ub=1200.0,
2025-05-07T20:33:05.0314801Z     contiguous=False,
2025-05-07T20:33:05.0315156Z     compiled=False,
2025-05-07T20:33:05.0315469Z )
2025-05-07T20:33:05.0315955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.0316792Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.0317274Z 
2025-05-07T20:33:05.0317406Z     @given(
2025-05-07T20:33:05.0317768Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.0318290Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.0318810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.0319350Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.0319898Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.0320377Z     )
2025-05-07T20:33:05.0320950Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.0321709Z     def test_silu_mul_quant(
2025-05-07T20:33:05.0322103Z         self,
2025-05-07T20:33:05.0322410Z         T: int,
2025-05-07T20:33:05.0322769Z         D: int,
2025-05-07T20:33:05.0323117Z         scale_ub: Optional[float],
2025-05-07T20:33:05.0323559Z         contiguous: bool,
2025-05-07T20:33:05.0323954Z         compiled: bool,
2025-05-07T20:33:05.0324313Z     ) -> None:
2025-05-07T20:33:05.0324657Z         torch.manual_seed(2025)
2025-05-07T20:33:05.0325051Z     
2025-05-07T20:33:05.0325489Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.0326051Z     
2025-05-07T20:33:05.0326364Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.0326835Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.0327339Z         x = x_sign * x_clamp
2025-05-07T20:33:05.0327731Z         x0 = x[:, :D]
2025-05-07T20:33:05.0328364Z         x1 = x[:, D:]
2025-05-07T20:33:05.0328699Z     
2025-05-07T20:33:05.0328996Z         if contiguous:
2025-05-07T20:33:05.0329369Z             x0 = x0.contiguous()
2025-05-07T20:33:05.0329780Z             x1 = x1.contiguous()
2025-05-07T20:33:05.0330171Z     
2025-05-07T20:33:05.0330475Z         if scale_ub is not None:
2025-05-07T20:33:05.0330912Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.0331468Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.0331983Z             )
2025-05-07T20:33:05.0332281Z         else:
2025-05-07T20:33:05.0332615Z             scale_ub_tensor = None
2025-05-07T20:33:05.0333025Z     
2025-05-07T20:33:05.0333385Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.0333925Z             op = silu_mul_quant
2025-05-07T20:33:05.0334336Z             if compiled:
2025-05-07T20:33:05.0334731Z                 op = torch.compile(op)
2025-05-07T20:33:05.0335214Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.0335667Z     
2025-05-07T20:33:05.0335969Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.0336237Z 
2025-05-07T20:33:05.0336392Z moe/activation_test.py:117: 
2025-05-07T20:33:05.0336877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.0337568Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.0338024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.0339319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.0340495Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.0341387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.0342719Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.0343849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.0344739Z     kernel = self.compile(
2025-05-07T20:33:05.0345642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.0346771Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.0347449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.0347846Z 
2025-05-07T20:33:05.0348192Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8173070>
2025-05-07T20:33:05.0350232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.0352717Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8367940>}
2025-05-07T20:33:05.0355094Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.0356894Z context = <triton._C.libtriton.ir.context object at 0x7fd1b817c730>
2025-05-07T20:33:05.0357381Z 
2025-05-07T20:33:05.0357673Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.0358565Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.0359374Z                            module_map=module_map)
2025-05-07T20:33:05.0359975Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.0360549Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.0360984Z E       ^
2025-05-07T20:33:05.0361869Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.0362677Z 
2025-05-07T20:33:05.0363416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.0364322Z 
2025-05-07T20:33:05.0364488Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.0365185Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.0365872Z     T=16384,
2025-05-07T20:33:05.0366173Z     D=5120,
2025-05-07T20:33:05.0366489Z     scale_ub=1200.0,
2025-05-07T20:33:05.0366847Z     contiguous=True,
2025-05-07T20:33:05.0367198Z     compiled=True,
2025-05-07T20:33:05.0367530Z )
2025-05-07T20:33:05.0368060Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.0368915Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.0369391Z 
2025-05-07T20:33:05.0369516Z     @given(
2025-05-07T20:33:05.0369889Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.0370408Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.0370906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.0371456Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.0372014Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.0372600Z     )
2025-05-07T20:33:05.0373195Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.0374006Z     def test_silu_mul_quant(
2025-05-07T20:33:05.0374398Z         self,
2025-05-07T20:33:05.0374705Z         T: int,
2025-05-07T20:33:05.0375020Z         D: int,
2025-05-07T20:33:05.0375371Z         scale_ub: Optional[float],
2025-05-07T20:33:05.0375810Z         contiguous: bool,
2025-05-07T20:33:05.0376263Z         compiled: bool,
2025-05-07T20:33:05.0376627Z     ) -> None:
2025-05-07T20:33:05.0376967Z         torch.manual_seed(2025)
2025-05-07T20:33:05.0377364Z     
2025-05-07T20:33:05.0377808Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.0378381Z     
2025-05-07T20:33:05.0378688Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.0379162Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.0379666Z         x = x_sign * x_clamp
2025-05-07T20:33:05.0380053Z         x0 = x[:, :D]
2025-05-07T20:33:05.0380406Z         x1 = x[:, D:]
2025-05-07T20:33:05.0380731Z     
2025-05-07T20:33:05.0381032Z         if contiguous:
2025-05-07T20:33:05.0381405Z             x0 = x0.contiguous()
2025-05-07T20:33:05.0381820Z             x1 = x1.contiguous()
2025-05-07T20:33:05.0382217Z     
2025-05-07T20:33:05.0382526Z         if scale_ub is not None:
2025-05-07T20:33:05.0383186Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.0383710Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.0384118Z             )
2025-05-07T20:33:05.0384374Z         else:
2025-05-07T20:33:05.0384646Z             scale_ub_tensor = None
2025-05-07T20:33:05.0385008Z     
2025-05-07T20:33:05.0385311Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.0385753Z             op = silu_mul_quant
2025-05-07T20:33:05.0386094Z             if compiled:
2025-05-07T20:33:05.0386425Z                 op = torch.compile(op)
2025-05-07T20:33:05.0386830Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.0387211Z     
2025-05-07T20:33:05.0387465Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.0387691Z 
2025-05-07T20:33:05.0387824Z moe/activation_test.py:117: 
2025-05-07T20:33:05.0388238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.0388737Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.0389153Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.0390146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.0391010Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.0392142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.0393213Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.0394011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.0395035Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.0396098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.0396973Z     kernel = self.compile(
2025-05-07T20:33:05.0397861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.0398949Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.0399625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.0400009Z 
2025-05-07T20:33:05.0400359Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8253730>
2025-05-07T20:33:05.0402427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.0404811Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b81d7550>}
2025-05-07T20:33:05.0407280Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.0409201Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8206130>
2025-05-07T20:33:05.0409701Z 
2025-05-07T20:33:05.0409990Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.0410880Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.0411684Z                            module_map=module_map)
2025-05-07T20:33:05.0412289Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.0412881Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.0413301Z E       ^
2025-05-07T20:33:05.0414103Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.0414902Z 
2025-05-07T20:33:05.0415639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.0416546Z 
2025-05-07T20:33:05.2649509Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2650290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2650959Z     T=16384,
2025-05-07T20:33:05.2651306Z     D=5120,
2025-05-07T20:33:05.2651617Z     scale_ub=None,
2025-05-07T20:33:05.2651968Z     contiguous=False,
2025-05-07T20:33:05.2652368Z     compiled=True,
2025-05-07T20:33:05.2652702Z )
2025-05-07T20:33:05.2653204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2654078Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.2654502Z 
2025-05-07T20:33:05.2654619Z     @given(
2025-05-07T20:33:05.2654939Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2655384Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2655825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2656318Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2656825Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2657271Z     )
2025-05-07T20:33:05.2657827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2658814Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2659194Z         self,
2025-05-07T20:33:05.2659486Z         T: int,
2025-05-07T20:33:05.2659771Z         D: int,
2025-05-07T20:33:05.2660111Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2660555Z         contiguous: bool,
2025-05-07T20:33:05.2660927Z         compiled: bool,
2025-05-07T20:33:05.2661284Z     ) -> None:
2025-05-07T20:33:05.2661628Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2662034Z     
2025-05-07T20:33:05.2662487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2663042Z     
2025-05-07T20:33:05.2663342Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2663809Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2664345Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2664741Z         x0 = x[:, :D]
2025-05-07T20:33:05.2665085Z         x1 = x[:, D:]
2025-05-07T20:33:05.2665417Z     
2025-05-07T20:33:05.2665716Z         if contiguous:
2025-05-07T20:33:05.2666092Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2666521Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2666917Z     
2025-05-07T20:33:05.2667223Z         if scale_ub is not None:
2025-05-07T20:33:05.2667675Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2668375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2668892Z             )
2025-05-07T20:33:05.2669303Z         else:
2025-05-07T20:33:05.2669641Z             scale_ub_tensor = None
2025-05-07T20:33:05.2670225Z     
2025-05-07T20:33:05.2670601Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2671149Z             op = silu_mul_quant
2025-05-07T20:33:05.2671573Z             if compiled:
2025-05-07T20:33:05.2671979Z                 op = torch.compile(op)
2025-05-07T20:33:05.2672647Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2673117Z     
2025-05-07T20:33:05.2673419Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2673697Z 
2025-05-07T20:33:05.2673862Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2674353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2674904Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2675374Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2676340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2677324Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2678454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2679664Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2680613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2681803Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2683308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2684237Z     kernel = self.compile(
2025-05-07T20:33:05.2685169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2686294Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2686953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2687341Z 
2025-05-07T20:33:05.2687670Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b828f610>
2025-05-07T20:33:05.2689504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2692068Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b83471f0>}
2025-05-07T20:33:05.2694420Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2696228Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8312570>
2025-05-07T20:33:05.2696726Z 
2025-05-07T20:33:05.2697013Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2697897Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2698678Z                            module_map=module_map)
2025-05-07T20:33:05.2699273Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2699853Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2700272Z E       ^
2025-05-07T20:33:05.2701068Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2701875Z 
2025-05-07T20:33:05.2702669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2703577Z 
2025-05-07T20:33:05.2703893Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2704590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2705335Z     T=2048,
2025-05-07T20:33:05.2705635Z     D=5120,
2025-05-07T20:33:05.2705929Z     scale_ub=None,
2025-05-07T20:33:05.2706276Z     contiguous=False,
2025-05-07T20:33:05.2706636Z     compiled=True,
2025-05-07T20:33:05.2706954Z )
2025-05-07T20:33:05.3911519Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3912796Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.3913292Z 
2025-05-07T20:33:05.3913419Z     @given(
2025-05-07T20:33:05.3913796Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3914289Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3914760Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3915265Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3915815Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3916284Z     )
2025-05-07T20:33:05.3916885Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3917646Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3918038Z         self,
2025-05-07T20:33:05.3918354Z         T: int,
2025-05-07T20:33:05.3918671Z         D: int,
2025-05-07T20:33:05.3919014Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3919466Z         contiguous: bool,
2025-05-07T20:33:05.3919861Z         compiled: bool,
2025-05-07T20:33:05.3920218Z     ) -> None:
2025-05-07T20:33:05.3920571Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3920963Z     
2025-05-07T20:33:05.3921399Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3921976Z     
2025-05-07T20:33:05.3922296Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3922762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3923280Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3923676Z         x0 = x[:, :D]
2025-05-07T20:33:05.3924029Z         x1 = x[:, D:]
2025-05-07T20:33:05.3924359Z     
2025-05-07T20:33:05.3924656Z         if contiguous:
2025-05-07T20:33:05.3925028Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3925450Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3925849Z     
2025-05-07T20:33:05.3926154Z         if scale_ub is not None:
2025-05-07T20:33:05.3926589Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3927145Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3927657Z             )
2025-05-07T20:33:05.3927956Z         else:
2025-05-07T20:33:05.3928430Z             scale_ub_tensor = None
2025-05-07T20:33:05.3928850Z     
2025-05-07T20:33:05.3929212Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3929739Z             op = silu_mul_quant
2025-05-07T20:33:05.3930150Z             if compiled:
2025-05-07T20:33:05.3930542Z                 op = torch.compile(op)
2025-05-07T20:33:05.3931033Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3931494Z     
2025-05-07T20:33:05.3931802Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3932077Z 
2025-05-07T20:33:05.3932239Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3932730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3933293Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3933751Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3934710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.3935681Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.3936804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3938002Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3939015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3940177Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3941392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3942298Z     kernel = self.compile(
2025-05-07T20:33:05.3943195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3944389Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3945053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3945459Z 
2025-05-07T20:33:05.3945797Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8335520>
2025-05-07T20:33:05.3947705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3950367Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8347f70>}
2025-05-07T20:33:05.3952742Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3964816Z context = <triton._C.libtriton.ir.context object at 0x7fd1b81b8730>
2025-05-07T20:33:05.3965341Z 
2025-05-07T20:33:05.3965642Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3966539Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3967350Z                            module_map=module_map)
2025-05-07T20:33:05.3967970Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3968549Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3968988Z E       ^
2025-05-07T20:33:05.3969792Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3970593Z 
2025-05-07T20:33:05.3971337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3972256Z 
2025-05-07T20:33:05.3972424Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3973125Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3973908Z     T=2048,
2025-05-07T20:33:05.3974213Z     D=5120,
2025-05-07T20:33:05.3974525Z     scale_ub=1200.0,
2025-05-07T20:33:05.3974887Z     contiguous=False,
2025-05-07T20:33:05.3975250Z     compiled=True,
2025-05-07T20:33:05.3975584Z )
2025-05-07T20:33:05.3976119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3976955Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.3977439Z 
2025-05-07T20:33:05.3977564Z     @given(
2025-05-07T20:33:05.3977936Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3978458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3978959Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3979510Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3980065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3980512Z     )
2025-05-07T20:33:05.3981037Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3981669Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3981999Z         self,
2025-05-07T20:33:05.3982265Z         T: int,
2025-05-07T20:33:05.3982542Z         D: int,
2025-05-07T20:33:05.3983198Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3983581Z         contiguous: bool,
2025-05-07T20:33:05.3984063Z         compiled: bool,
2025-05-07T20:33:05.3984448Z     ) -> None:
2025-05-07T20:33:05.3984727Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3985052Z     
2025-05-07T20:33:05.3985440Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3985982Z     
2025-05-07T20:33:05.3986259Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3986685Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3987298Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3987642Z         x0 = x[:, :D]
2025-05-07T20:33:05.3987956Z         x1 = x[:, D:]
2025-05-07T20:33:05.3988259Z     
2025-05-07T20:33:05.3988528Z         if contiguous:
2025-05-07T20:33:05.3988851Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3989213Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3989543Z     
2025-05-07T20:33:05.3989924Z         if scale_ub is not None:
2025-05-07T20:33:05.3990337Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3990830Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3991290Z             )
2025-05-07T20:33:05.3991565Z         else:
2025-05-07T20:33:05.3991861Z             scale_ub_tensor = None
2025-05-07T20:33:05.3992236Z     
2025-05-07T20:33:05.3992576Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3993061Z             op = silu_mul_quant
2025-05-07T20:33:05.3993462Z             if compiled:
2025-05-07T20:33:05.3993853Z                 op = torch.compile(op)
2025-05-07T20:33:05.3994275Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3994711Z     
2025-05-07T20:33:05.3995010Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3995270Z 
2025-05-07T20:33:05.3995432Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3995892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3996430Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3996877Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3997790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.3998729Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.3999838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.4001013Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.4001895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.4003177Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.4004292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.4005182Z     kernel = self.compile(
2025-05-07T20:33:05.4006073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.4007124Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.4007766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.4008142Z 
2025-05-07T20:33:05.4008471Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b81bf070>
2025-05-07T20:33:05.4010278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.4012726Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8195940>}
2025-05-07T20:33:05.4015114Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.4016831Z context = <triton._C.libtriton.ir.context object at 0x7fd1b808d5f0>
2025-05-07T20:33:05.4017401Z 
2025-05-07T20:33:05.4017672Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.4018573Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.4019386Z                            module_map=module_map)
2025-05-07T20:33:05.4020052Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.4020642Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.4021072Z E       ^
2025-05-07T20:33:05.4021878Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.4022681Z 
2025-05-07T20:33:05.4023412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.4024333Z 
2025-05-07T20:33:05.8001111Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.8001903Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.8002645Z     T=4096,
2025-05-07T20:33:05.8002944Z     D=5120,
2025-05-07T20:33:05.8003250Z     scale_ub=1200.0,
2025-05-07T20:33:05.8003618Z     contiguous=True,
2025-05-07T20:33:05.8003960Z     compiled=True,
2025-05-07T20:33:05.8004290Z )
2025-05-07T20:33:05.8004819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.8005645Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.8006073Z 
2025-05-07T20:33:05.8006189Z     @given(
2025-05-07T20:33:05.8006503Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.8006944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.8007372Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.8007855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.8008346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.8008779Z     )
2025-05-07T20:33:05.8009332Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.8010043Z     def test_silu_mul_quant(
2025-05-07T20:33:05.8010383Z         self,
2025-05-07T20:33:05.8010664Z         T: int,
2025-05-07T20:33:05.8010957Z         D: int,
2025-05-07T20:33:05.8011270Z         scale_ub: Optional[float],
2025-05-07T20:33:05.8011681Z         contiguous: bool,
2025-05-07T20:33:05.8012064Z         compiled: bool,
2025-05-07T20:33:05.8012403Z     ) -> None:
2025-05-07T20:33:05.8013065Z         torch.manual_seed(2025)
2025-05-07T20:33:05.8013471Z     
2025-05-07T20:33:05.8013931Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.8014498Z     
2025-05-07T20:33:05.8014790Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.8015250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.8015764Z         x = x_sign * x_clamp
2025-05-07T20:33:05.8016157Z         x0 = x[:, :D]
2025-05-07T20:33:05.8016508Z         x1 = x[:, D:]
2025-05-07T20:33:05.8016833Z     
2025-05-07T20:33:05.8017131Z         if contiguous:
2025-05-07T20:33:05.8017503Z             x0 = x0.contiguous()
2025-05-07T20:33:05.8017923Z             x1 = x1.contiguous()
2025-05-07T20:33:05.8018320Z     
2025-05-07T20:33:05.8018634Z         if scale_ub is not None:
2025-05-07T20:33:05.8019078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.8019637Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.8020164Z             )
2025-05-07T20:33:05.8020465Z         else:
2025-05-07T20:33:05.8020798Z             scale_ub_tensor = None
2025-05-07T20:33:05.8021203Z     
2025-05-07T20:33:05.8021579Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.8022091Z             op = silu_mul_quant
2025-05-07T20:33:05.8022497Z             if compiled:
2025-05-07T20:33:05.8023046Z                 op = torch.compile(op)
2025-05-07T20:33:05.8023626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.8024084Z     
2025-05-07T20:33:05.8024390Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.8024663Z 
2025-05-07T20:33:05.8024820Z moe/activation_test.py:117: 
2025-05-07T20:33:05.8025308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.8025998Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.8026451Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.8027410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.8028376Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.8029513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.8030926Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.8031857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.8033102Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.8034250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.8035169Z     kernel = self.compile(
2025-05-07T20:33:05.8036083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.8037244Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.8037905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.8038294Z 
2025-05-07T20:33:05.8038630Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b80b9c10>
2025-05-07T20:33:05.8040445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.8042881Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b80f7790>}
2025-05-07T20:33:05.8045226Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.8047010Z context = <triton._C.libtriton.ir.context object at 0x7fd1b80d56f0>
2025-05-07T20:33:05.8047600Z 
2025-05-07T20:33:05.8047878Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.8048774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.8049579Z                            module_map=module_map)
2025-05-07T20:33:05.8050170Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.8050742Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.8051155Z E       ^
2025-05-07T20:33:05.8051935Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.8052763Z 
2025-05-07T20:33:05.8053489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.8054400Z 
2025-05-07T20:33:05.8054574Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.8055272Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.8055945Z     T=128,
2025-05-07T20:33:05.8056244Z     D=5120,
2025-05-07T20:33:05.8056554Z     scale_ub=1200.0,
2025-05-07T20:33:05.8056897Z     contiguous=False,
2025-05-07T20:33:05.8057252Z     compiled=True,
2025-05-07T20:33:05.8057577Z )
2025-05-07T20:33:05.9381438Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.9383140Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.9383467Z 
2025-05-07T20:33:05.9383548Z     @given(
2025-05-07T20:33:05.9383784Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.9384107Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.9384423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.9384871Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.9385206Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.9385503Z     )
2025-05-07T20:33:05.9385871Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.9386340Z     def test_silu_mul_quant(
2025-05-07T20:33:05.9386576Z         self,
2025-05-07T20:33:05.9386771Z         T: int,
2025-05-07T20:33:05.9386969Z         D: int,
2025-05-07T20:33:05.9387183Z         scale_ub: Optional[float],
2025-05-07T20:33:05.9387457Z         contiguous: bool,
2025-05-07T20:33:05.9387702Z         compiled: bool,
2025-05-07T20:33:05.9387925Z     ) -> None:
2025-05-07T20:33:05.9388142Z         torch.manual_seed(2025)
2025-05-07T20:33:05.9388386Z     
2025-05-07T20:33:05.9388656Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.9389020Z     
2025-05-07T20:33:05.9389213Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.9389507Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.9389939Z         x = x_sign * x_clamp
2025-05-07T20:33:05.9390190Z         x0 = x[:, :D]
2025-05-07T20:33:05.9390407Z         x1 = x[:, D:]
2025-05-07T20:33:05.9390615Z     
2025-05-07T20:33:05.9390807Z         if contiguous:
2025-05-07T20:33:05.9391034Z             x0 = x0.contiguous()
2025-05-07T20:33:05.9391300Z             x1 = x1.contiguous()
2025-05-07T20:33:05.9391547Z     
2025-05-07T20:33:05.9391743Z         if scale_ub is not None:
2025-05-07T20:33:05.9392017Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.9392363Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.9392681Z             )
2025-05-07T20:33:05.9392865Z         else:
2025-05-07T20:33:05.9393074Z             scale_ub_tensor = None
2025-05-07T20:33:05.9393331Z     
2025-05-07T20:33:05.9393553Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.9393878Z             op = silu_mul_quant
2025-05-07T20:33:05.9394132Z             if compiled:
2025-05-07T20:33:05.9394372Z                 op = torch.compile(op)
2025-05-07T20:33:05.9394672Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.9395033Z     
2025-05-07T20:33:05.9395222Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.9395395Z 
2025-05-07T20:33:05.9395493Z moe/activation_test.py:117: 
2025-05-07T20:33:05.9395795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.9396140Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.9396424Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.9397020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.9397617Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.9398315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.9399058Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.9399620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.9400353Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.9401056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.9401624Z     kernel = self.compile(
2025-05-07T20:33:05.9402288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.9403060Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.9403473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.9403719Z 
2025-05-07T20:33:05.9403930Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8068220>
2025-05-07T20:33:05.9405102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.9406737Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b80690d0>}
2025-05-07T20:33:05.9408207Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.9409318Z context = <triton._C.libtriton.ir.context object at 0x7fd1b806a170>
2025-05-07T20:33:05.9409622Z 
2025-05-07T20:33:05.9409800Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.9410348Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.9410838Z                            module_map=module_map)
2025-05-07T20:33:05.9411217Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.9411584Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.9411846Z E       ^
2025-05-07T20:33:05.9412339Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.9412833Z 
2025-05-07T20:33:05.9413284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.9413838Z 
2025-05-07T20:33:05.9413947Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.9414369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.9414794Z     T=16384,
2025-05-07T20:33:05.9414994Z     D=7168,
2025-05-07T20:33:05.9415180Z     scale_ub=1200.0,
2025-05-07T20:33:05.9415408Z     contiguous=True,
2025-05-07T20:33:05.9415633Z     compiled=True,
2025-05-07T20:33:05.9415834Z )
2025-05-07T20:33:05.9416158Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.9416722Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.9417016Z 
2025-05-07T20:33:05.9417100Z     @given(
2025-05-07T20:33:05.9417323Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.9417647Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.9417961Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.9418297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.9418635Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.9418926Z     )
2025-05-07T20:33:05.9419284Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.9419751Z     def test_silu_mul_quant(
2025-05-07T20:33:05.9419995Z         self,
2025-05-07T20:33:05.9420181Z         T: int,
2025-05-07T20:33:05.9420382Z         D: int,
2025-05-07T20:33:05.9420598Z         scale_ub: Optional[float],
2025-05-07T20:33:05.9420874Z         contiguous: bool,
2025-05-07T20:33:05.9421109Z         compiled: bool,
2025-05-07T20:33:05.9421334Z     ) -> None:
2025-05-07T20:33:05.9421547Z         torch.manual_seed(2025)
2025-05-07T20:33:05.9421785Z     
2025-05-07T20:33:05.9422058Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.9422408Z     
2025-05-07T20:33:05.9422596Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.9422934Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.9423287Z         x = x_sign * x_clamp
2025-05-07T20:33:05.9423524Z         x0 = x[:, :D]
2025-05-07T20:33:05.9423743Z         x1 = x[:, D:]
2025-05-07T20:33:05.9423950Z     
2025-05-07T20:33:05.9424128Z         if contiguous:
2025-05-07T20:33:05.9424361Z             x0 = x0.contiguous()
2025-05-07T20:33:05.9424625Z             x1 = x1.contiguous()
2025-05-07T20:33:05.9424909Z     
2025-05-07T20:33:05.9425105Z         if scale_ub is not None:
2025-05-07T20:33:05.9425397Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.9425738Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.9426066Z             )
2025-05-07T20:33:05.9426256Z         else:
2025-05-07T20:33:05.9426468Z             scale_ub_tensor = None
2025-05-07T20:33:05.9426718Z     
2025-05-07T20:33:05.9426952Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.9427280Z             op = silu_mul_quant
2025-05-07T20:33:05.9427530Z             if compiled:
2025-05-07T20:33:05.9427783Z                 op = torch.compile(op)
2025-05-07T20:33:05.9428095Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.9428375Z     
2025-05-07T20:33:05.9428566Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.9428731Z 
2025-05-07T20:33:05.9428833Z moe/activation_test.py:117: 
2025-05-07T20:33:05.9429130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.9429476Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.9429760Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.9430443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.9431035Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.9431740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.9432511Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.9433094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.9433829Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.9434537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.9435105Z     kernel = self.compile(
2025-05-07T20:33:05.9435673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.9436421Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.9436842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.9437083Z 
2025-05-07T20:33:05.9437303Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b80bf6a0>
2025-05-07T20:33:05.9438469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.9439972Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8069d30>}
2025-05-07T20:33:05.9441439Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.9442580Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8116070>
2025-05-07T20:33:05.9442909Z 
2025-05-07T20:33:05.9443076Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.9443624Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.9444162Z                            module_map=module_map)
2025-05-07T20:33:05.9444577Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.9444931Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.9445197Z E       ^
2025-05-07T20:33:05.9445692Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.9446178Z 
2025-05-07T20:33:05.9446666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.9447228Z 
2025-05-07T20:33:06.2195436Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.2196102Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.2196708Z     T=16384,
2025-05-07T20:33:06.2196951Z     D=5120,
2025-05-07T20:33:06.2197168Z     scale_ub=1200.0,
2025-05-07T20:33:06.2197406Z     contiguous=True,
2025-05-07T20:33:06.2197639Z     compiled=False,
2025-05-07T20:33:06.2197846Z )
2025-05-07T20:33:06.2198261Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.2198960Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:06.2199263Z 
2025-05-07T20:33:06.2199342Z     @given(
2025-05-07T20:33:06.2199639Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.2200091Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.2200533Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.2201000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.2201427Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.2201748Z     )
2025-05-07T20:33:06.2202121Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.2202591Z     def test_silu_mul_quant(
2025-05-07T20:33:06.2202834Z         self,
2025-05-07T20:33:06.2203031Z         T: int,
2025-05-07T20:33:06.2203245Z         D: int,
2025-05-07T20:33:06.2211706Z         scale_ub: Optional[float],
2025-05-07T20:33:06.2211996Z         contiguous: bool,
2025-05-07T20:33:06.2212235Z         compiled: bool,
2025-05-07T20:33:06.2212492Z     ) -> None:
2025-05-07T20:33:06.2212737Z         torch.manual_seed(2025)
2025-05-07T20:33:06.2212982Z     
2025-05-07T20:33:06.2213259Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.2213628Z     
2025-05-07T20:33:06.2213821Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.2214125Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.2214455Z         x = x_sign * x_clamp
2025-05-07T20:33:06.2214860Z         x0 = x[:, :D]
2025-05-07T20:33:06.2215096Z         x1 = x[:, D:]
2025-05-07T20:33:06.2215320Z     
2025-05-07T20:33:06.2215507Z         if contiguous:
2025-05-07T20:33:06.2215753Z             x0 = x0.contiguous()
2025-05-07T20:33:06.2216023Z             x1 = x1.contiguous()
2025-05-07T20:33:06.2216266Z     
2025-05-07T20:33:06.2216474Z         if scale_ub is not None:
2025-05-07T20:33:06.2216762Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.2217114Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.2217435Z             )
2025-05-07T20:33:06.2217630Z         else:
2025-05-07T20:33:06.2217846Z             scale_ub_tensor = None
2025-05-07T20:33:06.2218102Z     
2025-05-07T20:33:06.2218342Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.2218676Z             op = silu_mul_quant
2025-05-07T20:33:06.2218930Z             if compiled:
2025-05-07T20:33:06.2219188Z                 op = torch.compile(op)
2025-05-07T20:33:06.2219496Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.2219771Z     
2025-05-07T20:33:06.2219970Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.2220138Z 
2025-05-07T20:33:06.2220245Z moe/activation_test.py:117: 
2025-05-07T20:33:06.2220548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.2220969Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.2221320Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.2222065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.2222817Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.2223387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.2224188Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.2224897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.2225472Z     kernel = self.compile(
2025-05-07T20:33:06.2226047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.2226744Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.2227156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.2227404Z 
2025-05-07T20:33:06.2227615Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7fef460>
2025-05-07T20:33:06.2228788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.2230391Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8018700>}
2025-05-07T20:33:06.2231859Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.2232967Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7fe0430>
2025-05-07T20:33:06.2233281Z 
2025-05-07T20:33:06.2233449Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.2233997Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.2234483Z                            module_map=module_map)
2025-05-07T20:33:06.2234859Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.2235229Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.2235499Z E       ^
2025-05-07T20:33:06.2236045Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.2236541Z 
2025-05-07T20:33:06.2236991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.2237544Z 
2025-05-07T20:33:06.2237650Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.2238073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.2238496Z     T=1,
2025-05-07T20:33:06.2238676Z     D=7168,
2025-05-07T20:33:06.2238862Z     scale_ub=1200.0,
2025-05-07T20:33:06.2239084Z     contiguous=False,
2025-05-07T20:33:06.2239317Z     compiled=False,
2025-05-07T20:33:06.2239515Z )
2025-05-07T20:33:06.2239838Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.2240358Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:06.2240637Z 
2025-05-07T20:33:06.2240718Z     @given(
2025-05-07T20:33:06.2240946Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.2241267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.2241581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.2241914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.2242260Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.2242605Z     )
2025-05-07T20:33:06.2242962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.2243466Z     def test_silu_mul_quant(
2025-05-07T20:33:06.2243711Z         self,
2025-05-07T20:33:06.2243900Z         T: int,
2025-05-07T20:33:06.2244104Z         D: int,
2025-05-07T20:33:06.2244333Z         scale_ub: Optional[float],
2025-05-07T20:33:06.2244612Z         contiguous: bool,
2025-05-07T20:33:06.2244893Z         compiled: bool,
2025-05-07T20:33:06.2245116Z     ) -> None:
2025-05-07T20:33:06.2245331Z         torch.manual_seed(2025)
2025-05-07T20:33:06.2245568Z     
2025-05-07T20:33:06.2245844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.2246199Z     
2025-05-07T20:33:06.2246384Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.2246683Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.2247004Z         x = x_sign * x_clamp
2025-05-07T20:33:06.2247251Z         x0 = x[:, :D]
2025-05-07T20:33:06.2247474Z         x1 = x[:, D:]
2025-05-07T20:33:06.2247688Z     
2025-05-07T20:33:06.2247869Z         if contiguous:
2025-05-07T20:33:06.2248102Z             x0 = x0.contiguous()
2025-05-07T20:33:06.2248363Z             x1 = x1.contiguous()
2025-05-07T20:33:06.2248601Z     
2025-05-07T20:33:06.2248790Z         if scale_ub is not None:
2025-05-07T20:33:06.2249064Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.2249410Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.2249728Z             )
2025-05-07T20:33:06.2249923Z         else:
2025-05-07T20:33:06.2250152Z             scale_ub_tensor = None
2025-05-07T20:33:06.2250414Z     
2025-05-07T20:33:06.2250645Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.2250968Z             op = silu_mul_quant
2025-05-07T20:33:06.2251215Z             if compiled:
2025-05-07T20:33:06.2251465Z                 op = torch.compile(op)
2025-05-07T20:33:06.2251769Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.2252047Z     
2025-05-07T20:33:06.2252240Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.2252406Z 
2025-05-07T20:33:06.2252510Z moe/activation_test.py:117: 
2025-05-07T20:33:06.2252813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.2253166Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.2253457Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.2254199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.2254985Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.2255560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.2256297Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.2257005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.2257576Z     kernel = self.compile(
2025-05-07T20:33:06.2258150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.2258851Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.2259260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.2259513Z 
2025-05-07T20:33:06.2259728Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7e6c3a0>
2025-05-07T20:33:06.2260905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.2262457Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7f530d0>}
2025-05-07T20:33:06.2263969Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.2265072Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7f547f0>
2025-05-07T20:33:06.2265385Z 
2025-05-07T20:33:06.2265596Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.2266150Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.2266641Z                            module_map=module_map)
2025-05-07T20:33:06.2267017Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.2267379Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.2267646Z E       ^
2025-05-07T20:33:06.2268132Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.2268626Z 
2025-05-07T20:33:06.2269076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.2269631Z 
2025-05-07T20:33:06.2269740Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.2270236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.2270652Z     T=4096,
2025-05-07T20:33:06.2270835Z     D=7168,
2025-05-07T20:33:06.2271029Z     scale_ub=1200.0,
2025-05-07T20:33:06.2271248Z     contiguous=False,
2025-05-07T20:33:06.2271474Z     compiled=True,
2025-05-07T20:33:06.2271680Z )
2025-05-07T20:33:06.3438074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.3438871Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:06.3439242Z 
2025-05-07T20:33:06.3439330Z     @given(
2025-05-07T20:33:06.3439599Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.3439940Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.3440279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.3440631Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.3440985Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.3441297Z     )
2025-05-07T20:33:06.3441669Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.3442160Z     def test_silu_mul_quant(
2025-05-07T20:33:06.3442421Z         self,
2025-05-07T20:33:06.3442621Z         T: int,
2025-05-07T20:33:06.3442833Z         D: int,
2025-05-07T20:33:06.3443353Z         scale_ub: Optional[float],
2025-05-07T20:33:06.3443639Z         contiguous: bool,
2025-05-07T20:33:06.3443894Z         compiled: bool,
2025-05-07T20:33:06.3444136Z     ) -> None:
2025-05-07T20:33:06.3444355Z         torch.manual_seed(2025)
2025-05-07T20:33:06.3444613Z     
2025-05-07T20:33:06.3444906Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.3445283Z     
2025-05-07T20:33:06.3445480Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.3445793Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.3446129Z         x = x_sign * x_clamp
2025-05-07T20:33:06.3446380Z         x0 = x[:, :D]
2025-05-07T20:33:06.3446608Z         x1 = x[:, D:]
2025-05-07T20:33:06.3446826Z     
2025-05-07T20:33:06.3447019Z         if contiguous:
2025-05-07T20:33:06.3447263Z             x0 = x0.contiguous()
2025-05-07T20:33:06.3447539Z             x1 = x1.contiguous()
2025-05-07T20:33:06.3447789Z     
2025-05-07T20:33:06.3447996Z         if scale_ub is not None:
2025-05-07T20:33:06.3448284Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.3448636Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.3448969Z             )
2025-05-07T20:33:06.3449164Z         else:
2025-05-07T20:33:06.3449491Z             scale_ub_tensor = None
2025-05-07T20:33:06.3449757Z     
2025-05-07T20:33:06.3449992Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.3450391Z             op = silu_mul_quant
2025-05-07T20:33:06.3450646Z             if compiled:
2025-05-07T20:33:06.3450895Z                 op = torch.compile(op)
2025-05-07T20:33:06.3451192Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3451477Z     
2025-05-07T20:33:06.3451782Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.3451951Z 
2025-05-07T20:33:06.3452052Z moe/activation_test.py:117: 
2025-05-07T20:33:06.3452368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3452729Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.3453055Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3453656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.3454269Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.3454982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.3455737Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.3456304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.3457041Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.3457757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.3458330Z     kernel = self.compile(
2025-05-07T20:33:06.3458901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.3459607Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.3460025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3460270Z 
2025-05-07T20:33:06.3460486Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7d97fa0>
2025-05-07T20:33:06.3461669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.3463264Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7f53dc0>}
2025-05-07T20:33:06.3464796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.3465909Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7ea7af0>
2025-05-07T20:33:06.3466216Z 
2025-05-07T20:33:06.3466397Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.3466945Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.3467443Z                            module_map=module_map)
2025-05-07T20:33:06.3467825Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.3468185Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.3468457Z E       ^
2025-05-07T20:33:06.3468956Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.3469447Z 
2025-05-07T20:33:06.3470038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.3470595Z 
2025-05-07T20:33:06.3470700Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.3471134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.3471560Z     T=128,
2025-05-07T20:33:06.3471797Z     D=7168,
2025-05-07T20:33:06.3471993Z     scale_ub=1200.0,
2025-05-07T20:33:06.3472262Z     contiguous=False,
2025-05-07T20:33:06.3472486Z     compiled=True,
2025-05-07T20:33:06.3472701Z )
2025-05-07T20:33:06.3473031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.3473557Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:06.3473884Z 
2025-05-07T20:33:06.3473960Z     @given(
2025-05-07T20:33:06.3474197Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.3474524Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.3474838Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.3475185Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.3475535Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.3475825Z     )
2025-05-07T20:33:06.3476197Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.3476667Z     def test_silu_mul_quant(
2025-05-07T20:33:06.3476912Z         self,
2025-05-07T20:33:06.3477112Z         T: int,
2025-05-07T20:33:06.3477314Z         D: int,
2025-05-07T20:33:06.3477538Z         scale_ub: Optional[float],
2025-05-07T20:33:06.3477812Z         contiguous: bool,
2025-05-07T20:33:06.3478060Z         compiled: bool,
2025-05-07T20:33:06.3478292Z     ) -> None:
2025-05-07T20:33:06.3478511Z         torch.manual_seed(2025)
2025-05-07T20:33:06.3478767Z     
2025-05-07T20:33:06.3479045Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.3479396Z     
2025-05-07T20:33:06.3479596Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.3479895Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.3480211Z         x = x_sign * x_clamp
2025-05-07T20:33:06.3480459Z         x0 = x[:, :D]
2025-05-07T20:33:06.3480682Z         x1 = x[:, D:]
2025-05-07T20:33:06.3480889Z     
2025-05-07T20:33:06.3481079Z         if contiguous:
2025-05-07T20:33:06.3481315Z             x0 = x0.contiguous()
2025-05-07T20:33:06.3481577Z             x1 = x1.contiguous()
2025-05-07T20:33:06.3481828Z     
2025-05-07T20:33:06.3482026Z         if scale_ub is not None:
2025-05-07T20:33:06.3482297Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.3482647Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.3483265Z             )
2025-05-07T20:33:06.3483532Z         else:
2025-05-07T20:33:06.3483781Z             scale_ub_tensor = None
2025-05-07T20:33:06.3484042Z     
2025-05-07T20:33:06.3484282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.3484690Z             op = silu_mul_quant
2025-05-07T20:33:06.3484952Z             if compiled:
2025-05-07T20:33:06.3485207Z                 op = torch.compile(op)
2025-05-07T20:33:06.3485510Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3485803Z     
2025-05-07T20:33:06.3486001Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.3486170Z 
2025-05-07T20:33:06.3486272Z moe/activation_test.py:117: 
2025-05-07T20:33:06.3486582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3486936Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.3487233Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.3487819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.3488429Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.3489143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.3489882Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.3490455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.3491195Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.3491976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.3492593Z     kernel = self.compile(
2025-05-07T20:33:06.3493166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.3493870Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.3494341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.3494596Z 
2025-05-07T20:33:06.3494808Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7d98b50>
2025-05-07T20:33:06.3495984Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.3497494Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7e64940>}
2025-05-07T20:33:06.3498973Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.3500078Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7f95930>
2025-05-07T20:33:06.3500401Z 
2025-05-07T20:33:06.3500574Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.3501128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.3501623Z                            module_map=module_map)
2025-05-07T20:33:06.3501994Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.3502360Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.3502630Z E       ^
2025-05-07T20:33:06.3503121Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.3503618Z 
2025-05-07T20:33:06.3504067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.3504630Z 
2025-05-07T20:33:06.5233768Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.5234461Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.5235051Z     T=2048,
2025-05-07T20:33:06.5235256Z     D=7168,
2025-05-07T20:33:06.5235451Z     scale_ub=None,
2025-05-07T20:33:06.5235681Z     contiguous=True,
2025-05-07T20:33:06.5236171Z     compiled=True,
2025-05-07T20:33:06.5236378Z )
2025-05-07T20:33:06.5236714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.5237241Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:06.5237528Z 
2025-05-07T20:33:06.5237608Z     @given(
2025-05-07T20:33:06.5237850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.5238182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.5238492Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.5238839Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.5239179Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.5239481Z     )
2025-05-07T20:33:06.5239842Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.5240312Z     def test_silu_mul_quant(
2025-05-07T20:33:06.5240561Z         self,
2025-05-07T20:33:06.5240756Z         T: int,
2025-05-07T20:33:06.5240958Z         D: int,
2025-05-07T20:33:06.5241180Z         scale_ub: Optional[float],
2025-05-07T20:33:06.5241454Z         contiguous: bool,
2025-05-07T20:33:06.5241703Z         compiled: bool,
2025-05-07T20:33:06.5241941Z     ) -> None:
2025-05-07T20:33:06.5242152Z         torch.manual_seed(2025)
2025-05-07T20:33:06.5242498Z     
2025-05-07T20:33:06.5242784Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.5243211Z     
2025-05-07T20:33:06.5243410Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.5243710Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.5244026Z         x = x_sign * x_clamp
2025-05-07T20:33:06.5244277Z         x0 = x[:, :D]
2025-05-07T20:33:06.5244502Z         x1 = x[:, D:]
2025-05-07T20:33:06.5244802Z     
2025-05-07T20:33:06.5244987Z         if contiguous:
2025-05-07T20:33:06.5245228Z             x0 = x0.contiguous()
2025-05-07T20:33:06.5245500Z             x1 = x1.contiguous()
2025-05-07T20:33:06.5245749Z     
2025-05-07T20:33:06.5245950Z         if scale_ub is not None:
2025-05-07T20:33:06.5246229Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.5246572Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.5246898Z             )
2025-05-07T20:33:06.5247092Z         else:
2025-05-07T20:33:06.5247303Z             scale_ub_tensor = None
2025-05-07T20:33:06.5247570Z     
2025-05-07T20:33:06.5247804Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.5248130Z             op = silu_mul_quant
2025-05-07T20:33:06.5248390Z             if compiled:
2025-05-07T20:33:06.5248646Z                 op = torch.compile(op)
2025-05-07T20:33:06.5248947Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.5249240Z     
2025-05-07T20:33:06.5249439Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.5249610Z 
2025-05-07T20:33:06.5249721Z moe/activation_test.py:117: 
2025-05-07T20:33:06.5250027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.5250381Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.5250680Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.5251280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:06.5251891Z     return fn(*args, **kwargs)
2025-05-07T20:33:06.5261385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.5262226Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.5262817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.5263566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.5264291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.5264947Z     kernel = self.compile(
2025-05-07T20:33:06.5265525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.5266233Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.5266658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.5266906Z 
2025-05-07T20:33:06.5267126Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b806c880>
2025-05-07T20:33:06.5268304Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.5270007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7c6a550>}
2025-05-07T20:33:06.5271497Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.5272621Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7f1dbb0>
2025-05-07T20:33:06.5272927Z 
2025-05-07T20:33:06.5273167Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.5273764Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.5274263Z                            module_map=module_map)
2025-05-07T20:33:06.5274649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.5275015Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.5275333Z E       ^
2025-05-07T20:33:06.5275842Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.5276335Z 
2025-05-07T20:33:06.5276790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.5277358Z 
2025-05-07T20:33:06.5277463Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.5277900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.5278332Z     T=16384,
2025-05-07T20:33:06.5278525Z     D=5120,
2025-05-07T20:33:06.5278730Z     scale_ub=None,
2025-05-07T20:33:06.5278953Z     contiguous=False,
2025-05-07T20:33:06.5279176Z     compiled=False,
2025-05-07T20:33:06.5279386Z )
2025-05-07T20:33:06.5279717Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.5280238Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:06.5280549Z 
2025-05-07T20:33:06.5280627Z     @given(
2025-05-07T20:33:06.5280858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.5281184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.5281505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.5281851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.5282187Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.5282493Z     )
2025-05-07T20:33:06.5283136Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.5283727Z     def test_silu_mul_quant(
2025-05-07T20:33:06.5284017Z         self,
2025-05-07T20:33:06.5284301Z         T: int,
2025-05-07T20:33:06.5284511Z         D: int,
2025-05-07T20:33:06.5284736Z         scale_ub: Optional[float],
2025-05-07T20:33:06.5285017Z         contiguous: bool,
2025-05-07T20:33:06.5285263Z         compiled: bool,
2025-05-07T20:33:06.5285488Z     ) -> None:
2025-05-07T20:33:06.5285709Z         torch.manual_seed(2025)
2025-05-07T20:33:06.5285955Z     
2025-05-07T20:33:06.5286227Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.5286705Z     
2025-05-07T20:33:06.5286907Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.5287199Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.5289429Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.5291523Z 
2025-05-07T20:33:06.5291646Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:06.5291877Z 
2025-05-07T20:33:06.5291979Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.5292420Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.5292839Z     T=4096,
2025-05-07T20:33:06.5293031Z     D=7168,
2025-05-07T20:33:06.5293229Z     scale_ub=1200.0,
2025-05-07T20:33:06.5293450Z     contiguous=True,
2025-05-07T20:33:06.5293684Z     compiled=True,
2025-05-07T20:33:06.5293891Z )
2025-05-07T20:33:06.5294293Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.5294861Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:06.5295159Z 
2025-05-07T20:33:06.5295236Z     @given(
2025-05-07T20:33:06.5295471Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.5295790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.5296109Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.5296520Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.5296855Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.5297154Z     )
2025-05-07T20:33:06.5297521Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.5297989Z     def test_silu_mul_quant(
2025-05-07T20:33:06.5298232Z         self,
2025-05-07T20:33:06.5298434Z         T: int,
2025-05-07T20:33:06.5298639Z         D: int,
2025-05-07T20:33:06.5298859Z         scale_ub: Optional[float],
2025-05-07T20:33:06.5299144Z         contiguous: bool,
2025-05-07T20:33:06.5299395Z         compiled: bool,
2025-05-07T20:33:06.5299619Z     ) -> None:
2025-05-07T20:33:06.5299838Z         torch.manual_seed(2025)
2025-05-07T20:33:06.5300088Z     
2025-05-07T20:33:06.5300358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.5300717Z     
2025-05-07T20:33:06.5300915Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.5301208Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.5303412Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.5305490Z 
2025-05-07T20:33:06.5305607Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:06.5305834Z 
2025-05-07T20:33:06.5305935Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.5306367Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.5306788Z     T=16384,
2025-05-07T20:33:06.5306985Z     D=7168,
2025-05-07T20:33:06.5307182Z     scale_ub=None,
2025-05-07T20:33:06.5307393Z     contiguous=False,
2025-05-07T20:33:06.5307630Z     compiled=False,
2025-05-07T20:33:06.5307890Z )
2025-05-07T20:33:06.6351243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.6352537Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:06.6352947Z 
2025-05-07T20:33:06.6353057Z     @given(
2025-05-07T20:33:06.6353372Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.6353698Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.6354033Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.6354384Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.6354723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.6355026Z     )
2025-05-07T20:33:06.6355397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.6355878Z     def test_silu_mul_quant(
2025-05-07T20:33:06.6356123Z         self,
2025-05-07T20:33:06.6356325Z         T: int,
2025-05-07T20:33:06.6356539Z         D: int,
2025-05-07T20:33:06.6356758Z         scale_ub: Optional[float],
2025-05-07T20:33:06.6357044Z         contiguous: bool,
2025-05-07T20:33:06.6357292Z         compiled: bool,
2025-05-07T20:33:06.6357524Z     ) -> None:
2025-05-07T20:33:06.6357752Z         torch.manual_seed(2025)
2025-05-07T20:33:06.6358026Z     
2025-05-07T20:33:06.6358579Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.6360919Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.6363096Z 
2025-05-07T20:33:06.6363214Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:06.6363444Z 
2025-05-07T20:33:06.6363547Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.6363977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.6364404Z     T=2048,
2025-05-07T20:33:06.6364588Z     D=7168,
2025-05-07T20:33:06.6364786Z     scale_ub=1200.0,
2025-05-07T20:33:06.6365011Z     contiguous=True,
2025-05-07T20:33:06.6365228Z     compiled=True,
2025-05-07T20:33:06.6365436Z )
2025-05-07T20:33:06.6365768Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.6366277Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:06.6366577Z 
2025-05-07T20:33:06.6366656Z     @given(
2025-05-07T20:33:06.6366887Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.6367210Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.6367521Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.6367864Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.6368208Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.6368500Z     )
2025-05-07T20:33:06.6368865Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.6369336Z     def test_silu_mul_quant(
2025-05-07T20:33:06.6369580Z         self,
2025-05-07T20:33:06.6369779Z         T: int,
2025-05-07T20:33:06.6369982Z         D: int,
2025-05-07T20:33:06.6370197Z         scale_ub: Optional[float],
2025-05-07T20:33:06.6370478Z         contiguous: bool,
2025-05-07T20:33:06.6370726Z         compiled: bool,
2025-05-07T20:33:06.6370942Z     ) -> None:
2025-05-07T20:33:06.6371159Z         torch.manual_seed(2025)
2025-05-07T20:33:06.6371407Z     
2025-05-07T20:33:06.6371672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.6372030Z     
2025-05-07T20:33:06.6372301Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.6372597Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.6376026Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.6378078Z 
2025-05-07T20:33:06.6378196Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:06.6378423Z 
2025-05-07T20:33:06.6378524Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.6378952Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.6379372Z     T=2048,
2025-05-07T20:33:06.6379549Z     D=7168,
2025-05-07T20:33:06.6379741Z     scale_ub=None,
2025-05-07T20:33:06.6379959Z     contiguous=True,
2025-05-07T20:33:06.6380183Z     compiled=False,
2025-05-07T20:33:06.6380398Z )
2025-05-07T20:33:06.6380804Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.6381318Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:06.6381649Z 
2025-05-07T20:33:06.6381731Z     @given(
2025-05-07T20:33:06.6381957Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.6382268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.6382582Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.6383235Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.6383566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.6383867Z     )
2025-05-07T20:33:06.6384234Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.6384698Z     def test_silu_mul_quant(
2025-05-07T20:33:06.6384937Z         self,
2025-05-07T20:33:06.6385135Z         T: int,
2025-05-07T20:33:06.6385330Z         D: int,
2025-05-07T20:33:06.6385541Z         scale_ub: Optional[float],
2025-05-07T20:33:06.6385816Z         contiguous: bool,
2025-05-07T20:33:06.6386057Z         compiled: bool,
2025-05-07T20:33:06.6386279Z     ) -> None:
2025-05-07T20:33:06.6386509Z         torch.manual_seed(2025)
2025-05-07T20:33:06.6386749Z     
2025-05-07T20:33:06.6387024Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.6387385Z     
2025-05-07T20:33:06.6387578Z >       x_sign = torch.sign(x)
2025-05-07T20:33:06.6389698Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:06.6391821Z 
2025-05-07T20:33:06.6391935Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:06.6392166Z 
2025-05-07T20:33:06.6392270Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.6392697Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.6393111Z     T=1,
2025-05-07T20:33:06.6393296Z     D=7168,
2025-05-07T20:33:06.6393491Z     scale_ub=1200.0,
2025-05-07T20:33:06.6393712Z     contiguous=True,
2025-05-07T20:33:06.6393936Z     compiled=False,
2025-05-07T20:33:06.6394138Z )
2025-05-07T20:33:06.9713201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.9713984Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:06.9714371Z 
2025-05-07T20:33:06.9714481Z     @given(
2025-05-07T20:33:06.9714708Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.9715030Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.9715351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.9715691Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.9716031Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.9716324Z     )
2025-05-07T20:33:06.9716679Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.9717147Z     def test_silu_mul_quant(
2025-05-07T20:33:06.9717402Z         self,
2025-05-07T20:33:06.9717589Z         T: int,
2025-05-07T20:33:06.9717786Z         D: int,
2025-05-07T20:33:06.9718006Z         scale_ub: Optional[float],
2025-05-07T20:33:06.9718277Z         contiguous: bool,
2025-05-07T20:33:06.9718518Z         compiled: bool,
2025-05-07T20:33:06.9718749Z     ) -> None:
2025-05-07T20:33:06.9718964Z         torch.manual_seed(2025)
2025-05-07T20:33:06.9719204Z     
2025-05-07T20:33:06.9719483Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.9719840Z     
2025-05-07T20:33:06.9720124Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.9720492Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.9720814Z         x = x_sign * x_clamp
2025-05-07T20:33:06.9721053Z         x0 = x[:, :D]
2025-05-07T20:33:06.9721273Z         x1 = x[:, D:]
2025-05-07T20:33:06.9721478Z     
2025-05-07T20:33:06.9721657Z         if contiguous:
2025-05-07T20:33:06.9721888Z             x0 = x0.contiguous()
2025-05-07T20:33:06.9722234Z             x1 = x1.contiguous()
2025-05-07T20:33:06.9722472Z     
2025-05-07T20:33:06.9722664Z         if scale_ub is not None:
2025-05-07T20:33:06.9722943Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.9723282Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.9723605Z             )
2025-05-07T20:33:06.9723798Z         else:
2025-05-07T20:33:06.9724007Z             scale_ub_tensor = None
2025-05-07T20:33:06.9724255Z     
2025-05-07T20:33:06.9724491Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.9724817Z             op = silu_mul_quant
2025-05-07T20:33:06.9725070Z             if compiled:
2025-05-07T20:33:06.9725322Z                 op = torch.compile(op)
2025-05-07T20:33:06.9725626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.9725906Z     
2025-05-07T20:33:06.9726097Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.9726263Z 
2025-05-07T20:33:06.9726368Z moe/activation_test.py:117: 
2025-05-07T20:33:06.9726667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9727008Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.9727296Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.9728039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.9728781Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.9729356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.9730089Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.9730794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.9731363Z     kernel = self.compile(
2025-05-07T20:33:06.9731936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.9732687Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.9733144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9733393Z 
2025-05-07T20:33:06.9733605Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7b6ea90>
2025-05-07T20:33:06.9734782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.9736300Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7cdc040>}
2025-05-07T20:33:06.9737763Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.9738875Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7ccdef0>
2025-05-07T20:33:06.9739184Z 
2025-05-07T20:33:06.9739356Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.9739907Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.9740391Z                            module_map=module_map)
2025-05-07T20:33:06.9740812Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.9741178Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.9741483Z E       ^
2025-05-07T20:33:06.9741972Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.9742466Z 
2025-05-07T20:33:06.9742911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.9743508Z 
2025-05-07T20:33:06.9743615Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.9744038Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.9744463Z     T=128,
2025-05-07T20:33:06.9744650Z     D=5120,
2025-05-07T20:33:06.9744840Z     scale_ub=None,
2025-05-07T20:33:06.9745047Z     contiguous=True,
2025-05-07T20:33:06.9745271Z     compiled=False,
2025-05-07T20:33:06.9745482Z )
2025-05-07T20:33:06.9745801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.9746321Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:06.9746607Z 
2025-05-07T20:33:06.9746694Z     @given(
2025-05-07T20:33:06.9746917Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.9747244Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.9747560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.9747893Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.9748233Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.9748529Z     )
2025-05-07T20:33:06.9748892Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.9749356Z     def test_silu_mul_quant(
2025-05-07T20:33:06.9749605Z         self,
2025-05-07T20:33:06.9749955Z         T: int,
2025-05-07T20:33:06.9750147Z         D: int,
2025-05-07T20:33:06.9750367Z         scale_ub: Optional[float],
2025-05-07T20:33:06.9750640Z         contiguous: bool,
2025-05-07T20:33:06.9750885Z         compiled: bool,
2025-05-07T20:33:06.9751110Z     ) -> None:
2025-05-07T20:33:06.9751324Z         torch.manual_seed(2025)
2025-05-07T20:33:06.9751561Z     
2025-05-07T20:33:06.9751835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.9752194Z     
2025-05-07T20:33:06.9752381Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.9752673Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.9752992Z         x = x_sign * x_clamp
2025-05-07T20:33:06.9753231Z         x0 = x[:, :D]
2025-05-07T20:33:06.9753448Z         x1 = x[:, D:]
2025-05-07T20:33:06.9753654Z     
2025-05-07T20:33:06.9753889Z         if contiguous:
2025-05-07T20:33:06.9754115Z             x0 = x0.contiguous()
2025-05-07T20:33:06.9754376Z             x1 = x1.contiguous()
2025-05-07T20:33:06.9754617Z     
2025-05-07T20:33:06.9754798Z         if scale_ub is not None:
2025-05-07T20:33:06.9755074Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.9755417Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.9755729Z             )
2025-05-07T20:33:06.9755923Z         else:
2025-05-07T20:33:06.9756132Z             scale_ub_tensor = None
2025-05-07T20:33:06.9756378Z     
2025-05-07T20:33:06.9756609Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.9756937Z             op = silu_mul_quant
2025-05-07T20:33:06.9757182Z             if compiled:
2025-05-07T20:33:06.9757434Z                 op = torch.compile(op)
2025-05-07T20:33:06.9757737Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.9758010Z     
2025-05-07T20:33:06.9758209Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:06.9758383Z 
2025-05-07T20:33:06.9758482Z moe/activation_test.py:117: 
2025-05-07T20:33:06.9758788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9759125Z moe/activation_test.py:115: in fn
2025-05-07T20:33:06.9759413Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.9760200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:06.9760977Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:06.9761544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.9762279Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.9763080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.9763642Z     kernel = self.compile(
2025-05-07T20:33:06.9764216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.9764918Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.9765330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.9765578Z 
2025-05-07T20:33:06.9765790Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7d08eb0>
2025-05-07T20:33:06.9766964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.9768469Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7cdca60>}
2025-05-07T20:33:06.9769939Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.9771044Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7bac370>
2025-05-07T20:33:06.9771354Z 
2025-05-07T20:33:06.9771524Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.9772076Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.9772567Z                            module_map=module_map)
2025-05-07T20:33:06.9772934Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.9773296Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.9773568Z E       ^
2025-05-07T20:33:06.9774052Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.9774546Z 
2025-05-07T20:33:06.9775070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.9775633Z 
2025-05-07T20:33:06.9775735Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.9776164Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.9776581Z     T=128,
2025-05-07T20:33:06.9776768Z     D=7168,
2025-05-07T20:33:06.9776963Z     scale_ub=None,
2025-05-07T20:33:06.9777169Z     contiguous=True,
2025-05-07T20:33:06.9777392Z     compiled=False,
2025-05-07T20:33:06.9777595Z )
2025-05-07T20:33:07.0673751Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.0674472Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.0674882Z 
2025-05-07T20:33:07.0674989Z     @given(
2025-05-07T20:33:07.0675306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.0675695Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.0676040Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.0676383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.0676716Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.0677016Z     )
2025-05-07T20:33:07.0677645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.0678119Z     def test_silu_mul_quant(
2025-05-07T20:33:07.0686239Z         self,
2025-05-07T20:33:07.0686449Z         T: int,
2025-05-07T20:33:07.0686657Z         D: int,
2025-05-07T20:33:07.0686883Z         scale_ub: Optional[float],
2025-05-07T20:33:07.0687154Z         contiguous: bool,
2025-05-07T20:33:07.0687400Z         compiled: bool,
2025-05-07T20:33:07.0687637Z     ) -> None:
2025-05-07T20:33:07.0688021Z         torch.manual_seed(2025)
2025-05-07T20:33:07.0688278Z     
2025-05-07T20:33:07.0688563Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.0688924Z     
2025-05-07T20:33:07.0689128Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.0689426Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.0689771Z         x = x_sign * x_clamp
2025-05-07T20:33:07.0690012Z         x0 = x[:, :D]
2025-05-07T20:33:07.0690232Z         x1 = x[:, D:]
2025-05-07T20:33:07.0690442Z     
2025-05-07T20:33:07.0690632Z         if contiguous:
2025-05-07T20:33:07.0690871Z             x0 = x0.contiguous()
2025-05-07T20:33:07.0691140Z             x1 = x1.contiguous()
2025-05-07T20:33:07.0691389Z     
2025-05-07T20:33:07.0691576Z         if scale_ub is not None:
2025-05-07T20:33:07.0691852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.0692209Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.0692532Z             )
2025-05-07T20:33:07.0692732Z         else:
2025-05-07T20:33:07.0692952Z             scale_ub_tensor = None
2025-05-07T20:33:07.0693204Z     
2025-05-07T20:33:07.0693437Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.0693769Z             op = silu_mul_quant
2025-05-07T20:33:07.0694024Z             if compiled:
2025-05-07T20:33:07.0694276Z                 op = torch.compile(op)
2025-05-07T20:33:07.0694584Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.0694866Z     
2025-05-07T20:33:07.0695071Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.0695240Z 
2025-05-07T20:33:07.0695346Z moe/activation_test.py:117: 
2025-05-07T20:33:07.0695654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.0695999Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.0696292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.0697031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.0697781Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.0698425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.0699163Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.0699875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.0700442Z     kernel = self.compile(
2025-05-07T20:33:07.0701012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.0701715Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.0702131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.0702373Z 
2025-05-07T20:33:07.0702618Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7b8e3d0>
2025-05-07T20:33:07.0703810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.0705332Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7d6f790>}
2025-05-07T20:33:07.0706866Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.0708027Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7d77b70>
2025-05-07T20:33:07.0708332Z 
2025-05-07T20:33:07.0708499Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.0709051Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.0709585Z                            module_map=module_map)
2025-05-07T20:33:07.0710080Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.0710438Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.0710703Z E       ^
2025-05-07T20:33:07.0711194Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.0711682Z 
2025-05-07T20:33:07.0712134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.0712701Z 
2025-05-07T20:33:07.0712802Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.0713234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.0713658Z     T=2048,
2025-05-07T20:33:07.0713843Z     D=7168,
2025-05-07T20:33:07.0714039Z     scale_ub=1200.0,
2025-05-07T20:33:07.0714265Z     contiguous=True,
2025-05-07T20:33:07.0714483Z     compiled=False,
2025-05-07T20:33:07.0714690Z )
2025-05-07T20:33:07.0715015Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.0715535Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.0715831Z 
2025-05-07T20:33:07.0715908Z     @given(
2025-05-07T20:33:07.0716136Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.0716447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.0716766Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.0717112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.0717449Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.0717741Z     )
2025-05-07T20:33:07.0718103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.0718565Z     def test_silu_mul_quant(
2025-05-07T20:33:07.0718806Z         self,
2025-05-07T20:33:07.0719000Z         T: int,
2025-05-07T20:33:07.0719199Z         D: int,
2025-05-07T20:33:07.0719412Z         scale_ub: Optional[float],
2025-05-07T20:33:07.0719688Z         contiguous: bool,
2025-05-07T20:33:07.0719980Z         compiled: bool,
2025-05-07T20:33:07.0720202Z     ) -> None:
2025-05-07T20:33:07.0720419Z         torch.manual_seed(2025)
2025-05-07T20:33:07.0720667Z     
2025-05-07T20:33:07.0720934Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.0723186Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.0725238Z 
2025-05-07T20:33:07.0725358Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.0725587Z 
2025-05-07T20:33:07.0725686Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.0726113Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.0726524Z     T=1,
2025-05-07T20:33:07.0726706Z     D=5120,
2025-05-07T20:33:07.0726894Z     scale_ub=1200.0,
2025-05-07T20:33:07.0727156Z     contiguous=True,
2025-05-07T20:33:07.0727382Z     compiled=False,
2025-05-07T20:33:07.0727627Z )
2025-05-07T20:33:07.1203388Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.1204144Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.1204528Z 
2025-05-07T20:33:07.1204644Z     @given(
2025-05-07T20:33:07.1204873Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.1205412Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.1205736Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.1206083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.1206427Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.1206728Z     )
2025-05-07T20:33:07.1207085Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.1207557Z     def test_silu_mul_quant(
2025-05-07T20:33:07.1207803Z         self,
2025-05-07T20:33:07.1208000Z         T: int,
2025-05-07T20:33:07.1208199Z         D: int,
2025-05-07T20:33:07.1208428Z         scale_ub: Optional[float],
2025-05-07T20:33:07.1208701Z         contiguous: bool,
2025-05-07T20:33:07.1208945Z         compiled: bool,
2025-05-07T20:33:07.1209174Z     ) -> None:
2025-05-07T20:33:07.1209391Z         torch.manual_seed(2025)
2025-05-07T20:33:07.1209631Z     
2025-05-07T20:33:07.1209907Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.1210270Z     
2025-05-07T20:33:07.1210463Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.1210762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.1211090Z         x = x_sign * x_clamp
2025-05-07T20:33:07.1211332Z         x0 = x[:, :D]
2025-05-07T20:33:07.1211551Z         x1 = x[:, D:]
2025-05-07T20:33:07.1211763Z     
2025-05-07T20:33:07.1211947Z         if contiguous:
2025-05-07T20:33:07.1212180Z             x0 = x0.contiguous()
2025-05-07T20:33:07.1212470Z             x1 = x1.contiguous()
2025-05-07T20:33:07.1212744Z     
2025-05-07T20:33:07.1212943Z         if scale_ub is not None:
2025-05-07T20:33:07.1213225Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.1213567Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.1213898Z             )
2025-05-07T20:33:07.1214094Z         else:
2025-05-07T20:33:07.1214307Z             scale_ub_tensor = None
2025-05-07T20:33:07.1214569Z     
2025-05-07T20:33:07.1214800Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.1215125Z             op = silu_mul_quant
2025-05-07T20:33:07.1215374Z             if compiled:
2025-05-07T20:33:07.1215726Z                 op = torch.compile(op)
2025-05-07T20:33:07.1216032Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.1216311Z     
2025-05-07T20:33:07.1216505Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.1216674Z 
2025-05-07T20:33:07.1216780Z moe/activation_test.py:117: 
2025-05-07T20:33:07.1217081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.1217433Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.1217726Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.1218474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.1219214Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.1219784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.1220518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.1221221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.1221793Z     kernel = self.compile(
2025-05-07T20:33:07.1222367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.1223184Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.1223653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.1223902Z 
2025-05-07T20:33:07.1224113Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7d78880>
2025-05-07T20:33:07.1225281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.1226867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7ca3040>}
2025-05-07T20:33:07.1228401Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.1229981Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7cbaaf0>
2025-05-07T20:33:07.1230298Z 
2025-05-07T20:33:07.1230467Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.1231021Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.1231506Z                            module_map=module_map)
2025-05-07T20:33:07.1231888Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.1232247Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.1232513Z E       ^
2025-05-07T20:33:07.1233006Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.1233499Z 
2025-05-07T20:33:07.1233947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.1234502Z 
2025-05-07T20:33:07.1234613Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.1235039Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.1235461Z     T=2048,
2025-05-07T20:33:07.1235648Z     D=5120,
2025-05-07T20:33:07.1235838Z     scale_ub=None,
2025-05-07T20:33:07.1236048Z     contiguous=True,
2025-05-07T20:33:07.1236279Z     compiled=False,
2025-05-07T20:33:07.1236491Z )
2025-05-07T20:33:07.1236817Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.1237341Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.1237629Z 
2025-05-07T20:33:07.1237781Z     @given(
2025-05-07T20:33:07.1238006Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.1238327Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.1238644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.1238985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.1239329Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.1239629Z     )
2025-05-07T20:33:07.1239986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.1240446Z     def test_silu_mul_quant(
2025-05-07T20:33:07.1240696Z         self,
2025-05-07T20:33:07.1240881Z         T: int,
2025-05-07T20:33:07.1241077Z         D: int,
2025-05-07T20:33:07.1241296Z         scale_ub: Optional[float],
2025-05-07T20:33:07.1241563Z         contiguous: bool,
2025-05-07T20:33:07.1241803Z         compiled: bool,
2025-05-07T20:33:07.1242027Z     ) -> None:
2025-05-07T20:33:07.1242240Z         torch.manual_seed(2025)
2025-05-07T20:33:07.1242512Z     
2025-05-07T20:33:07.1242813Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.1243162Z     
2025-05-07T20:33:07.1243353Z >       x_sign = torch.sign(x)
2025-05-07T20:33:07.1245532Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.1247669Z 
2025-05-07T20:33:07.1247785Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:07.1248003Z 
2025-05-07T20:33:07.1248114Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.1248535Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.1248960Z     T=16384,
2025-05-07T20:33:07.1249156Z     D=5120,
2025-05-07T20:33:07.1249339Z     scale_ub=None,
2025-05-07T20:33:07.1249552Z     contiguous=True,
2025-05-07T20:33:07.1249776Z     compiled=False,
2025-05-07T20:33:07.1249968Z )
2025-05-07T20:33:07.1250295Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.1250814Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.1251105Z 
2025-05-07T20:33:07.1251188Z     @given(
2025-05-07T20:33:07.1251408Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.1251732Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.1252043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.1252378Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.1252718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.1253006Z     )
2025-05-07T20:33:07.1253363Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.1253828Z     def test_silu_mul_quant(
2025-05-07T20:33:07.1254066Z         self,
2025-05-07T20:33:07.1254255Z         T: int,
2025-05-07T20:33:07.1254446Z         D: int,
2025-05-07T20:33:07.1254663Z         scale_ub: Optional[float],
2025-05-07T20:33:07.1254941Z         contiguous: bool,
2025-05-07T20:33:07.1255175Z         compiled: bool,
2025-05-07T20:33:07.1255397Z     ) -> None:
2025-05-07T20:33:07.1255608Z         torch.manual_seed(2025)
2025-05-07T20:33:07.1255849Z     
2025-05-07T20:33:07.1256121Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.1258417Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.1260476Z 
2025-05-07T20:33:07.1260603Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.1260820Z 
2025-05-07T20:33:07.1260931Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.1261348Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.1261769Z     T=4096,
2025-05-07T20:33:07.1261950Z     D=5120,
2025-05-07T20:33:07.1262136Z     scale_ub=None,
2025-05-07T20:33:07.1262351Z     contiguous=True,
2025-05-07T20:33:07.1262575Z     compiled=False,
2025-05-07T20:33:07.1262772Z )
2025-05-07T20:33:07.2293498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2294246Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.2294635Z 
2025-05-07T20:33:07.2294745Z     @given(
2025-05-07T20:33:07.2295050Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2295703Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2296110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2296628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2297030Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2297321Z     )
2025-05-07T20:33:07.2297681Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2298142Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2298463Z         self,
2025-05-07T20:33:07.2298652Z         T: int,
2025-05-07T20:33:07.2298840Z         D: int,
2025-05-07T20:33:07.2299059Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2299333Z         contiguous: bool,
2025-05-07T20:33:07.2299567Z         compiled: bool,
2025-05-07T20:33:07.2299791Z     ) -> None:
2025-05-07T20:33:07.2300008Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2300246Z     
2025-05-07T20:33:07.2300518Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2302757Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.2304825Z 
2025-05-07T20:33:07.2304942Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.2305158Z 
2025-05-07T20:33:07.2305266Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2305682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2306101Z     T=2048,
2025-05-07T20:33:07.2306284Z     D=5120,
2025-05-07T20:33:07.2306467Z     scale_ub=None,
2025-05-07T20:33:07.2306682Z     contiguous=False,
2025-05-07T20:33:07.2306908Z     compiled=False,
2025-05-07T20:33:07.2307107Z )
2025-05-07T20:33:07.2307433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2307950Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.2308237Z 
2025-05-07T20:33:07.2308318Z     @given(
2025-05-07T20:33:07.2308540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2308864Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2309178Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2309592Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2310069Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2310368Z     )
2025-05-07T20:33:07.2310723Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2311190Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2311443Z         self,
2025-05-07T20:33:07.2311635Z         T: int,
2025-05-07T20:33:07.2311831Z         D: int,
2025-05-07T20:33:07.2312051Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2312325Z         contiguous: bool,
2025-05-07T20:33:07.2312562Z         compiled: bool,
2025-05-07T20:33:07.2312788Z     ) -> None:
2025-05-07T20:33:07.2313003Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2313242Z     
2025-05-07T20:33:07.2313518Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2315808Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.2317883Z 
2025-05-07T20:33:07.2318003Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.2318220Z 
2025-05-07T20:33:07.2318327Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2318745Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2319208Z     T=4096,
2025-05-07T20:33:07.2319390Z     D=7168,
2025-05-07T20:33:07.2319572Z     scale_ub=None,
2025-05-07T20:33:07.2319785Z     contiguous=True,
2025-05-07T20:33:07.2320006Z     compiled=True,
2025-05-07T20:33:07.2320202Z )
2025-05-07T20:33:07.2320530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2321044Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.2321324Z 
2025-05-07T20:33:07.2321399Z     @given(
2025-05-07T20:33:07.2321626Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2321944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2322262Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2322596Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2322934Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2323227Z     )
2025-05-07T20:33:07.2323577Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2324041Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2324284Z         self,
2025-05-07T20:33:07.2324470Z         T: int,
2025-05-07T20:33:07.2324667Z         D: int,
2025-05-07T20:33:07.2324891Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2325157Z         contiguous: bool,
2025-05-07T20:33:07.2325398Z         compiled: bool,
2025-05-07T20:33:07.2325618Z     ) -> None:
2025-05-07T20:33:07.2325826Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2326074Z     
2025-05-07T20:33:07.2326350Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2328597Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.2330699Z 
2025-05-07T20:33:07.2330826Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.2331048Z 
2025-05-07T20:33:07.2331152Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2331586Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2332013Z     T=2048,
2025-05-07T20:33:07.2332197Z     D=5120,
2025-05-07T20:33:07.2332387Z     scale_ub=1200.0,
2025-05-07T20:33:07.2332613Z     contiguous=False,
2025-05-07T20:33:07.2332835Z     compiled=False,
2025-05-07T20:33:07.2333043Z )
2025-05-07T20:33:07.2333371Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2333894Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.2334217Z 
2025-05-07T20:33:07.2334304Z     @given(
2025-05-07T20:33:07.2334525Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2334847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2335167Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2335501Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2335839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2336132Z     )
2025-05-07T20:33:07.2336482Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2336993Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2337304Z         self,
2025-05-07T20:33:07.2337495Z         T: int,
2025-05-07T20:33:07.2337682Z         D: int,
2025-05-07T20:33:07.2337902Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2338173Z         contiguous: bool,
2025-05-07T20:33:07.2338408Z         compiled: bool,
2025-05-07T20:33:07.2338628Z     ) -> None:
2025-05-07T20:33:07.2338841Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2339125Z     
2025-05-07T20:33:07.2339398Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2341627Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.2343727Z 
2025-05-07T20:33:07.2343848Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.2344067Z 
2025-05-07T20:33:07.2344177Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2344594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2345016Z     T=4096,
2025-05-07T20:33:07.2345200Z     D=7168,
2025-05-07T20:33:07.2345383Z     scale_ub=1200.0,
2025-05-07T20:33:07.2345603Z     contiguous=True,
2025-05-07T20:33:07.2345824Z     compiled=False,
2025-05-07T20:33:07.2346027Z )
2025-05-07T20:33:07.2346360Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2346876Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.2347165Z 
2025-05-07T20:33:07.2347245Z     @given(
2025-05-07T20:33:07.2347470Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2355169Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2355538Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2355903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2356253Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2356557Z     )
2025-05-07T20:33:07.2356932Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2357413Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2357679Z         self,
2025-05-07T20:33:07.2357963Z         T: int,
2025-05-07T20:33:07.2358176Z         D: int,
2025-05-07T20:33:07.2358405Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2358686Z         contiguous: bool,
2025-05-07T20:33:07.2358939Z         compiled: bool,
2025-05-07T20:33:07.2359174Z     ) -> None:
2025-05-07T20:33:07.2359397Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2359662Z     
2025-05-07T20:33:07.2359955Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2362236Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.2364308Z 
2025-05-07T20:33:07.2364437Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.2364663Z 
2025-05-07T20:33:07.2364769Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2365252Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2365680Z     T=16384,
2025-05-07T20:33:07.2365914Z     D=7168,
2025-05-07T20:33:07.2366115Z     scale_ub=None,
2025-05-07T20:33:07.2366336Z     contiguous=False,
2025-05-07T20:33:07.2366564Z     compiled=True,
2025-05-07T20:33:07.2366772Z )
2025-05-07T20:33:07.3653523Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.3654310Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.3655033Z 
2025-05-07T20:33:07.3655119Z     @given(
2025-05-07T20:33:07.3655363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.3655704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.3656032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.3656368Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.3656711Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.3657009Z     )
2025-05-07T20:33:07.3657373Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.3657851Z     def test_silu_mul_quant(
2025-05-07T20:33:07.3658107Z         self,
2025-05-07T20:33:07.3658314Z         T: int,
2025-05-07T20:33:07.3658507Z         D: int,
2025-05-07T20:33:07.3658732Z         scale_ub: Optional[float],
2025-05-07T20:33:07.3659015Z         contiguous: bool,
2025-05-07T20:33:07.3659254Z         compiled: bool,
2025-05-07T20:33:07.3659490Z     ) -> None:
2025-05-07T20:33:07.3659708Z         torch.manual_seed(2025)
2025-05-07T20:33:07.3659947Z     
2025-05-07T20:33:07.3660227Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.3662492Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.3664583Z 
2025-05-07T20:33:07.3664700Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.3664918Z 
2025-05-07T20:33:07.3665029Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.3665450Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.3665868Z     T=4096,
2025-05-07T20:33:07.3666055Z     D=7168,
2025-05-07T20:33:07.3666330Z     scale_ub=None,
2025-05-07T20:33:07.3666548Z     contiguous=True,
2025-05-07T20:33:07.3666769Z     compiled=False,
2025-05-07T20:33:07.3666976Z )
2025-05-07T20:33:07.3667302Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.3667823Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.3668109Z 
2025-05-07T20:33:07.3668196Z     @given(
2025-05-07T20:33:07.3668418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.3668738Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.3669048Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.3669378Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.3669713Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.3670157Z     )
2025-05-07T20:33:07.3670512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.3670980Z     def test_silu_mul_quant(
2025-05-07T20:33:07.3671247Z         self,
2025-05-07T20:33:07.3671446Z         T: int,
2025-05-07T20:33:07.3671632Z         D: int,
2025-05-07T20:33:07.3671849Z         scale_ub: Optional[float],
2025-05-07T20:33:07.3672125Z         contiguous: bool,
2025-05-07T20:33:07.3672370Z         compiled: bool,
2025-05-07T20:33:07.3672723Z     ) -> None:
2025-05-07T20:33:07.3672945Z         torch.manual_seed(2025)
2025-05-07T20:33:07.3673267Z     
2025-05-07T20:33:07.3673539Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.3675777Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.3677873Z 
2025-05-07T20:33:07.3677992Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.3678214Z 
2025-05-07T20:33:07.3678322Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.3678753Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.3679180Z     T=16384,
2025-05-07T20:33:07.3679375Z     D=7168,
2025-05-07T20:33:07.3679568Z     scale_ub=None,
2025-05-07T20:33:07.3679773Z     contiguous=True,
2025-05-07T20:33:07.3679997Z     compiled=False,
2025-05-07T20:33:07.3680200Z )
2025-05-07T20:33:07.3680523Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.3681051Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.3681345Z 
2025-05-07T20:33:07.3681430Z     @given(
2025-05-07T20:33:07.3681652Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.3681969Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.3682289Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.3682654Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.3683288Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.3683581Z     )
2025-05-07T20:33:07.3683944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.3684410Z     def test_silu_mul_quant(
2025-05-07T20:33:07.3684658Z         self,
2025-05-07T20:33:07.3684868Z         T: int,
2025-05-07T20:33:07.3685063Z         D: int,
2025-05-07T20:33:07.3685282Z         scale_ub: Optional[float],
2025-05-07T20:33:07.3685560Z         contiguous: bool,
2025-05-07T20:33:07.3685796Z         compiled: bool,
2025-05-07T20:33:07.3686019Z     ) -> None:
2025-05-07T20:33:07.3686240Z         torch.manual_seed(2025)
2025-05-07T20:33:07.3686477Z     
2025-05-07T20:33:07.3686823Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.3689067Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.3691127Z 
2025-05-07T20:33:07.3691243Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.3691462Z 
2025-05-07T20:33:07.3691570Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.3691990Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.3692416Z     T=16384,
2025-05-07T20:33:07.3692619Z     D=7168,
2025-05-07T20:33:07.3692803Z     scale_ub=1200.0,
2025-05-07T20:33:07.3693027Z     contiguous=True,
2025-05-07T20:33:07.3693247Z     compiled=False,
2025-05-07T20:33:07.3693445Z )
2025-05-07T20:33:07.3693772Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.3694361Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.3694711Z 
2025-05-07T20:33:07.3694792Z     @given(
2025-05-07T20:33:07.3695014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.3695339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.3695653Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.3695988Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.3696389Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.3696687Z     )
2025-05-07T20:33:07.3697047Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.3697516Z     def test_silu_mul_quant(
2025-05-07T20:33:07.3697765Z         self,
2025-05-07T20:33:07.3697953Z         T: int,
2025-05-07T20:33:07.3698152Z         D: int,
2025-05-07T20:33:07.3698377Z         scale_ub: Optional[float],
2025-05-07T20:33:07.3698656Z         contiguous: bool,
2025-05-07T20:33:07.3698892Z         compiled: bool,
2025-05-07T20:33:07.3699121Z     ) -> None:
2025-05-07T20:33:07.3699335Z         torch.manual_seed(2025)
2025-05-07T20:33:07.3699572Z     
2025-05-07T20:33:07.3699843Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.3702088Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.3704199Z 
2025-05-07T20:33:07.3704323Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.3704544Z 
2025-05-07T20:33:07.3704654Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.3705075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.3705493Z     T=128,
2025-05-07T20:33:07.3705680Z     D=5120,
2025-05-07T20:33:07.3705866Z     scale_ub=1200.0,
2025-05-07T20:33:07.3706089Z     contiguous=False,
2025-05-07T20:33:07.3706314Z     compiled=False,
2025-05-07T20:33:07.3706518Z )
2025-05-07T20:33:07.5328610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5329343Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.5329970Z 
2025-05-07T20:33:07.5330066Z     @given(
2025-05-07T20:33:07.5330310Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5330634Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5330955Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5331312Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5331644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5331947Z     )
2025-05-07T20:33:07.5332305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5332771Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5333012Z         self,
2025-05-07T20:33:07.5333206Z         T: int,
2025-05-07T20:33:07.5333404Z         D: int,
2025-05-07T20:33:07.5333622Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5333899Z         contiguous: bool,
2025-05-07T20:33:07.5334142Z         compiled: bool,
2025-05-07T20:33:07.5334362Z     ) -> None:
2025-05-07T20:33:07.5334586Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5334837Z     
2025-05-07T20:33:07.5335103Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5335459Z     
2025-05-07T20:33:07.5335654Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5336089Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5336432Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5336745Z         x0 = x[:, :D]
2025-05-07T20:33:07.5336965Z         x1 = x[:, D:]
2025-05-07T20:33:07.5337182Z     
2025-05-07T20:33:07.5337364Z         if contiguous:
2025-05-07T20:33:07.5337601Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5337872Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5338123Z     
2025-05-07T20:33:07.5338316Z         if scale_ub is not None:
2025-05-07T20:33:07.5338684Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5339031Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5339345Z             )
2025-05-07T20:33:07.5339542Z         else:
2025-05-07T20:33:07.5339753Z             scale_ub_tensor = None
2025-05-07T20:33:07.5340007Z     
2025-05-07T20:33:07.5340242Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5340567Z             op = silu_mul_quant
2025-05-07T20:33:07.5340816Z             if compiled:
2025-05-07T20:33:07.5341067Z                 op = torch.compile(op)
2025-05-07T20:33:07.5341373Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5341652Z     
2025-05-07T20:33:07.5341841Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5342008Z 
2025-05-07T20:33:07.5342110Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5342416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5342812Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5343099Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5343844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5344581Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5345150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5345882Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5346592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5347155Z     kernel = self.compile(
2025-05-07T20:33:07.5347731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5348431Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5348839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5349087Z 
2025-05-07T20:33:07.5349347Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7a370d0>
2025-05-07T20:33:07.5350776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5352307Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b79c5ca0>}
2025-05-07T20:33:07.5353782Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5354884Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7985470>
2025-05-07T20:33:07.5355199Z 
2025-05-07T20:33:07.5355369Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5355926Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5356420Z                            module_map=module_map)
2025-05-07T20:33:07.5356788Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5357151Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5357421Z E       ^
2025-05-07T20:33:07.5357957Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5358496Z 
2025-05-07T20:33:07.5358944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5359512Z 
2025-05-07T20:33:07.5359616Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5360082Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5360494Z     T=2048,
2025-05-07T20:33:07.5360686Z     D=7168,
2025-05-07T20:33:07.5360881Z     scale_ub=None,
2025-05-07T20:33:07.5361097Z     contiguous=False,
2025-05-07T20:33:07.5361325Z     compiled=False,
2025-05-07T20:33:07.5361536Z )
2025-05-07T20:33:07.5361859Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5362378Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.5362669Z 
2025-05-07T20:33:07.5362754Z     @given(
2025-05-07T20:33:07.5362990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5363308Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5363629Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5363970Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5364305Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5364602Z     )
2025-05-07T20:33:07.5364966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5365421Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5365670Z         self,
2025-05-07T20:33:07.5365863Z         T: int,
2025-05-07T20:33:07.5366053Z         D: int,
2025-05-07T20:33:07.5366268Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5366544Z         contiguous: bool,
2025-05-07T20:33:07.5366779Z         compiled: bool,
2025-05-07T20:33:07.5367001Z     ) -> None:
2025-05-07T20:33:07.5367220Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5367468Z     
2025-05-07T20:33:07.5367737Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5370033Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.5372093Z 
2025-05-07T20:33:07.5372211Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.5372433Z 
2025-05-07T20:33:07.5372542Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5373018Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5373442Z     T=128,
2025-05-07T20:33:07.5373628Z     D=7168,
2025-05-07T20:33:07.5373817Z     scale_ub=1200.0,
2025-05-07T20:33:07.5374035Z     contiguous=True,
2025-05-07T20:33:07.5374254Z     compiled=True,
2025-05-07T20:33:07.5374457Z )
2025-05-07T20:33:07.5825389Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5826197Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.5826580Z 
2025-05-07T20:33:07.5826684Z     @given(
2025-05-07T20:33:07.5826999Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5827351Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5827667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5828008Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5828347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5828639Z     )
2025-05-07T20:33:07.5829188Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5829722Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5830093Z         self,
2025-05-07T20:33:07.5830285Z         T: int,
2025-05-07T20:33:07.5830479Z         D: int,
2025-05-07T20:33:07.5830700Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5830970Z         contiguous: bool,
2025-05-07T20:33:07.5831310Z         compiled: bool,
2025-05-07T20:33:07.5831543Z     ) -> None:
2025-05-07T20:33:07.5831763Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5832018Z     
2025-05-07T20:33:07.5832305Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5832660Z     
2025-05-07T20:33:07.5832860Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5833161Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5833486Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5833738Z         x0 = x[:, :D]
2025-05-07T20:33:07.5833965Z         x1 = x[:, D:]
2025-05-07T20:33:07.5834173Z     
2025-05-07T20:33:07.5834364Z         if contiguous:
2025-05-07T20:33:07.5834604Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5834866Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5835119Z     
2025-05-07T20:33:07.5835317Z         if scale_ub is not None:
2025-05-07T20:33:07.5835596Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5835939Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5836261Z             )
2025-05-07T20:33:07.5836455Z         else:
2025-05-07T20:33:07.5836660Z             scale_ub_tensor = None
2025-05-07T20:33:07.5836919Z     
2025-05-07T20:33:07.5837152Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5837469Z             op = silu_mul_quant
2025-05-07T20:33:07.5837722Z             if compiled:
2025-05-07T20:33:07.5837969Z                 op = torch.compile(op)
2025-05-07T20:33:07.5838268Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5838551Z     
2025-05-07T20:33:07.5838742Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5838911Z 
2025-05-07T20:33:07.5839007Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5839310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5839658Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5839947Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5840532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5841131Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5841929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5842670Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5843236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5843975Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5844686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5845248Z     kernel = self.compile(
2025-05-07T20:33:07.5845820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5846519Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5846927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5847176Z 
2025-05-07T20:33:07.5847390Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b790b6d0>
2025-05-07T20:33:07.5848607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5850122Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b793e0d0>}
2025-05-07T20:33:07.5851633Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5852780Z context = <triton._C.libtriton.ir.context object at 0x7fd1b78301b0>
2025-05-07T20:33:07.5853091Z 
2025-05-07T20:33:07.5853262Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5853812Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5854308Z                            module_map=module_map)
2025-05-07T20:33:07.5854683Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5855048Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5855315Z E       ^
2025-05-07T20:33:07.5855803Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5856295Z 
2025-05-07T20:33:07.5856740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5857298Z 
2025-05-07T20:33:07.5857403Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5857828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5858244Z     T=128,
2025-05-07T20:33:07.5858434Z     D=7168,
2025-05-07T20:33:07.5858629Z     scale_ub=1200.0,
2025-05-07T20:33:07.5858840Z     contiguous=True,
2025-05-07T20:33:07.5859063Z     compiled=False,
2025-05-07T20:33:07.5859266Z )
2025-05-07T20:33:07.5859585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5860108Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.5860397Z 
2025-05-07T20:33:07.5860478Z     @given(
2025-05-07T20:33:07.5860708Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5861023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5861338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5861678Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5862011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5862302Z     )
2025-05-07T20:33:07.5862661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5863196Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5863443Z         self,
2025-05-07T20:33:07.5863637Z         T: int,
2025-05-07T20:33:07.5863826Z         D: int,
2025-05-07T20:33:07.5864043Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5864318Z         contiguous: bool,
2025-05-07T20:33:07.5864551Z         compiled: bool,
2025-05-07T20:33:07.5864775Z     ) -> None:
2025-05-07T20:33:07.5873213Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5873504Z     
2025-05-07T20:33:07.5873797Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5874164Z     
2025-05-07T20:33:07.5874363Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5874658Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5876866Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.5879040Z 
2025-05-07T20:33:07.5879166Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.5879439Z 
2025-05-07T20:33:07.5879554Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5879985Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5880419Z     T=128,
2025-05-07T20:33:07.5880615Z     D=5120,
2025-05-07T20:33:07.5880815Z     scale_ub=1200.0,
2025-05-07T20:33:07.5881083Z     contiguous=True,
2025-05-07T20:33:07.5881311Z     compiled=True,
2025-05-07T20:33:07.5881518Z )
2025-05-07T20:33:07.5881840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5882365Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.5882647Z 
2025-05-07T20:33:07.5883098Z     @given(
2025-05-07T20:33:07.5883336Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5883662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5883984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5884315Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5884658Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5884946Z     )
2025-05-07T20:33:07.5885314Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5885778Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5886030Z         self,
2025-05-07T20:33:07.5886231Z         T: int,
2025-05-07T20:33:07.5886427Z         D: int,
2025-05-07T20:33:07.5886650Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5886932Z         contiguous: bool,
2025-05-07T20:33:07.5887171Z         compiled: bool,
2025-05-07T20:33:07.5887402Z     ) -> None:
2025-05-07T20:33:07.5887617Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5887863Z     
2025-05-07T20:33:07.5888134Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5888492Z     
2025-05-07T20:33:07.5888686Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5888975Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5891281Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.5893379Z 
2025-05-07T20:33:07.5893495Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.5893712Z 
2025-05-07T20:33:07.5893820Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5894242Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5894657Z     T=128,
2025-05-07T20:33:07.5894842Z     D=7168,
2025-05-07T20:33:07.5895032Z     scale_ub=None,
2025-05-07T20:33:07.5895238Z     contiguous=True,
2025-05-07T20:33:07.5895455Z     compiled=True,
2025-05-07T20:33:07.5895652Z )
2025-05-07T20:33:07.7945735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.7946481Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.7946887Z 
2025-05-07T20:33:07.7946979Z     @given(
2025-05-07T20:33:07.7947221Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.7947555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.7947873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.7948208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.7948549Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.7948843Z     )
2025-05-07T20:33:07.7949422Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.7950127Z     def test_silu_mul_quant(
2025-05-07T20:33:07.7950374Z         self,
2025-05-07T20:33:07.7950564Z         T: int,
2025-05-07T20:33:07.7950777Z         D: int,
2025-05-07T20:33:07.7950999Z         scale_ub: Optional[float],
2025-05-07T20:33:07.7951272Z         contiguous: bool,
2025-05-07T20:33:07.7951516Z         compiled: bool,
2025-05-07T20:33:07.7951839Z     ) -> None:
2025-05-07T20:33:07.7952067Z         torch.manual_seed(2025)
2025-05-07T20:33:07.7952316Z     
2025-05-07T20:33:07.7952594Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.7954847Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.7956913Z 
2025-05-07T20:33:07.7957041Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.7957260Z 
2025-05-07T20:33:07.7967260Z FAILED
2025-05-07T20:33:07.7967552Z 
2025-05-07T20:33:07.7967928Z =================================== FAILURES ===================================
2025-05-07T20:33:07.7968567Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:07.7969260Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:07.7970115Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:33:07.7970902Z   |     yield
2025-05-07T20:33:07.7971505Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run
2025-05-07T20:33:07.7972251Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:07.7973088Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
2025-05-07T20:33:07.7973854Z   |     method()
2025-05-07T20:33:07.7974769Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:07.7975838Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.7976883Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:07.7977786Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:07.7978478Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:07.7979149Z   +-+---------------- 1 ----------------
2025-05-07T20:33:07.7979560Z     | Traceback (most recent call last):
2025-05-07T20:33:07.7980586Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:07.7981704Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.7984942Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.7987854Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:07.7988566Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.7989210Z     |     T=2048,
2025-05-07T20:33:07.7989529Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:07.7990217Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:07.7990721Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:07.7991236Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:07.7991749Z     | )
2025-05-07T20:33:07.7991988Z     | 
2025-05-07T20:33:07.7992732Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:07.7993607Z     +---------------- 2 ----------------
2025-05-07T20:33:07.7994012Z     | Traceback (most recent call last):
2025-05-07T20:33:07.7995041Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:07.7996159Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.7999159Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.8002060Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:07.8002685Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8003266Z     |     T=128,
2025-05-07T20:33:07.8003555Z     |     D=7168,
2025-05-07T20:33:07.8003816Z     |     scale_ub=None,
2025-05-07T20:33:07.8004064Z     |     contiguous=True,
2025-05-07T20:33:07.8004317Z     |     compiled=True,
2025-05-07T20:33:07.8004539Z     | )
2025-05-07T20:33:07.8004727Z     | 
2025-05-07T20:33:07.8005285Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:07.8005930Z     +---------------- 3 ----------------
2025-05-07T20:33:07.8006239Z     | Traceback (most recent call last):
2025-05-07T20:33:07.8007085Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:07.8007931Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8010172Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.8012334Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:07.8012799Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8013229Z     |     T=128,
2025-05-07T20:33:07.8013448Z     |     D=5120,
2025-05-07T20:33:07.8013665Z     |     scale_ub=1200.0,
2025-05-07T20:33:07.8013920Z     |     contiguous=True,
2025-05-07T20:33:07.8014162Z     |     compiled=True,
2025-05-07T20:33:07.8014399Z     | )
2025-05-07T20:33:07.8014583Z     | 
2025-05-07T20:33:07.8015184Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:07.8015876Z     +---------------- 4 ----------------
2025-05-07T20:33:07.8016180Z     | Traceback (most recent call last):
2025-05-07T20:33:07.8016939Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:07.8017865Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.8018913Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:07.8019946Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8021175Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:07.8022382Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.8023300Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:07.8024376Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8025468Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:07.8026622Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8027833Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:33:07.8029040Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8030334Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:07.8031381Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.8032349Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:07.8033205Z     |     fn()
2025-05-07T20:33:07.8034048Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:07.8035001Z     |     self.fn.run(
2025-05-07T20:33:07.8035926Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:07.8036790Z     |     kernel = self.compile(
2025-05-07T20:33:07.8037696Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:07.8038758Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8039823Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:07.8040965Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8041723Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8042234Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.8042616Z     | ^
2025-05-07T20:33:07.8043308Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8044169Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:07.8044761Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:07.8045517Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8046166Z     |     T=1,  # or any other generated value
2025-05-07T20:33:07.8046703Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:07.8047238Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:07.8047774Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:07.8048313Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:07.8048762Z     | )
2025-05-07T20:33:07.8049020Z     | 
2025-05-07T20:33:07.8049766Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:07.8050736Z     +------------------------------------
2025-05-07T20:33:07.8051260Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:07.8051816Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8052422Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8053067Z     T=1,
2025-05-07T20:33:07.8053335Z     D=5120,
2025-05-07T20:33:07.8053614Z     scale_ub=None,
2025-05-07T20:33:07.8053911Z     contiguous=True,
2025-05-07T20:33:07.8054220Z     compiled=True,
2025-05-07T20:33:07.8054512Z )
2025-05-07T20:33:07.8054961Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8055674Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.8056066Z 
2025-05-07T20:33:07.8056178Z     @given(
2025-05-07T20:33:07.8056509Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8056960Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8057408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8057896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8058373Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8058797Z     )
2025-05-07T20:33:07.8059311Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8059974Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8060324Z         self,
2025-05-07T20:33:07.8060606Z         T: int,
2025-05-07T20:33:07.8060893Z         D: int,
2025-05-07T20:33:07.8061198Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8061591Z         contiguous: bool,
2025-05-07T20:33:07.8061939Z         compiled: bool,
2025-05-07T20:33:07.8062252Z     ) -> None:
2025-05-07T20:33:07.8062558Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8062920Z     
2025-05-07T20:33:07.8063307Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8063811Z     
2025-05-07T20:33:07.8064091Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8064561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8065013Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8065358Z         x0 = x[:, :D]
2025-05-07T20:33:07.8065652Z         x1 = x[:, D:]
2025-05-07T20:33:07.8065954Z     
2025-05-07T20:33:07.8066218Z         if contiguous:
2025-05-07T20:33:07.8066544Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8066918Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8067270Z     
2025-05-07T20:33:07.8067537Z         if scale_ub is not None:
2025-05-07T20:33:07.8067931Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8068409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8068861Z             )
2025-05-07T20:33:07.8069132Z         else:
2025-05-07T20:33:07.8069444Z             scale_ub_tensor = None
2025-05-07T20:33:07.8069960Z     
2025-05-07T20:33:07.8070291Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8070752Z             op = silu_mul_quant
2025-05-07T20:33:07.8071121Z             if compiled:
2025-05-07T20:33:07.8071477Z                 op = torch.compile(op)
2025-05-07T20:33:07.8071915Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8072321Z     
2025-05-07T20:33:07.8072591Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.8073060Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.8073494Z     
2025-05-07T20:33:07.8073820Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8074354Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.8074787Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.8075244Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.8075761Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8076254Z     
2025-05-07T20:33:07.8076540Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.8076832Z 
2025-05-07T20:33:07.8076974Z moe/activation_test.py:126: 
2025-05-07T20:33:07.8077418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8077891Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.8078337Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8079500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.8080631Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.8081424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8082434Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8083732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.8084829Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8085966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.8087086Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8088194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.8089158Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.8090056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.8090792Z     fn()
2025-05-07T20:33:07.8091527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.8092410Z     self.fn.run(
2025-05-07T20:33:07.8093070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8093999Z     kernel = self.compile(
2025-05-07T20:33:07.8094778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8095709Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8096269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8096606Z 
2025-05-07T20:33:07.8096898Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bbeb9040>
2025-05-07T20:33:07.8098500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8100642Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bbfdc9d0>}
2025-05-07T20:33:07.8102704Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8104249Z context = <triton._C.libtriton.ir.context object at 0x7fd1a8b3ca30>
2025-05-07T20:33:07.8104694Z 
2025-05-07T20:33:07.8105016Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8105871Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8106542Z                            module_map=module_map)
2025-05-07T20:33:07.8107054Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8107551Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.8108008Z E       ^
2025-05-07T20:33:07.8108692Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8109374Z 
2025-05-07T20:33:07.8110105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8110872Z 
2025-05-07T20:33:07.8111027Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8111599Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8112189Z     T=2048,
2025-05-07T20:33:07.8112459Z     D=5120,
2025-05-07T20:33:07.8112722Z     scale_ub=1200.0,
2025-05-07T20:33:07.8113014Z     contiguous=True,
2025-05-07T20:33:07.8113309Z     compiled=False,
2025-05-07T20:33:07.8113584Z )
2025-05-07T20:33:07.8114041Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8114771Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.8115145Z 
2025-05-07T20:33:07.8115259Z     @given(
2025-05-07T20:33:07.8115575Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8116034Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8116481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8116956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8117438Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8117862Z     )
2025-05-07T20:33:07.8118371Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8119052Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8119385Z         self,
2025-05-07T20:33:07.8119646Z         T: int,
2025-05-07T20:33:07.8119912Z         D: int,
2025-05-07T20:33:07.8120216Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8120591Z         contiguous: bool,
2025-05-07T20:33:07.8120923Z         compiled: bool,
2025-05-07T20:33:07.8121252Z     ) -> None:
2025-05-07T20:33:07.8121567Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8121920Z     
2025-05-07T20:33:07.8122313Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8122929Z     
2025-05-07T20:33:07.8123219Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8123636Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8124091Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8124436Z         x0 = x[:, :D]
2025-05-07T20:33:07.8124746Z         x1 = x[:, D:]
2025-05-07T20:33:07.8125053Z     
2025-05-07T20:33:07.8125312Z         if contiguous:
2025-05-07T20:33:07.8125648Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8126024Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8126367Z     
2025-05-07T20:33:07.8126634Z         if scale_ub is not None:
2025-05-07T20:33:07.8127002Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8127451Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8127877Z             )
2025-05-07T20:33:07.8128155Z         else:
2025-05-07T20:33:07.8147495Z             scale_ub_tensor = None
2025-05-07T20:33:07.8147846Z     
2025-05-07T20:33:07.8148179Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8148658Z             op = silu_mul_quant
2025-05-07T20:33:07.8149022Z             if compiled:
2025-05-07T20:33:07.8149390Z                 op = torch.compile(op)
2025-05-07T20:33:07.8149950Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8150372Z     
2025-05-07T20:33:07.8150764Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8151016Z 
2025-05-07T20:33:07.8151249Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8151686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8152171Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8152588Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8153636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8154692Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8155447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8156421Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8157369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8158123Z     kernel = self.compile(
2025-05-07T20:33:07.8158907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8159887Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8160473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8160816Z 
2025-05-07T20:33:07.8161112Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1a835b790>
2025-05-07T20:33:07.8162808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8164885Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1bb06cdc0>}
2025-05-07T20:33:07.8166931Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8168477Z context = <triton._C.libtriton.ir.context object at 0x7fd1baee4cf0>
2025-05-07T20:33:07.8168914Z 
2025-05-07T20:33:07.8169152Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8169878Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8170543Z                            module_map=module_map)
2025-05-07T20:33:07.8171103Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8171595Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8171954Z E       ^
2025-05-07T20:33:07.8172606Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8173319Z 
2025-05-07T20:33:07.8173920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8174670Z 
2025-05-07T20:33:07.8174811Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8175386Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8175944Z     T=2048,
2025-05-07T20:33:07.8176201Z     D=5120,
2025-05-07T20:33:07.8176475Z     scale_ub=1200.0,
2025-05-07T20:33:07.8176780Z     contiguous=True,
2025-05-07T20:33:07.8177076Z     compiled=True,
2025-05-07T20:33:07.8177359Z )
2025-05-07T20:33:07.8177816Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8178473Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.8178882Z 
2025-05-07T20:33:07.8178990Z     @given(
2025-05-07T20:33:07.8179303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8179823Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8180265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8180794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8181277Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8181679Z     )
2025-05-07T20:33:07.8182088Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8182557Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8183183Z         self,
2025-05-07T20:33:07.8183385Z         T: int,
2025-05-07T20:33:07.8183582Z         D: int,
2025-05-07T20:33:07.8183794Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8184076Z         contiguous: bool,
2025-05-07T20:33:07.8184321Z         compiled: bool,
2025-05-07T20:33:07.8184546Z     ) -> None:
2025-05-07T20:33:07.8184757Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8185005Z     
2025-05-07T20:33:07.8185280Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8185634Z     
2025-05-07T20:33:07.8185823Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8186123Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8186437Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8186679Z         x0 = x[:, :D]
2025-05-07T20:33:07.8186897Z         x1 = x[:, D:]
2025-05-07T20:33:07.8187098Z     
2025-05-07T20:33:07.8187281Z         if contiguous:
2025-05-07T20:33:07.8187511Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8187768Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8188014Z     
2025-05-07T20:33:07.8188208Z         if scale_ub is not None:
2025-05-07T20:33:07.8188483Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8188826Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8189147Z             )
2025-05-07T20:33:07.8189334Z         else:
2025-05-07T20:33:07.8189551Z             scale_ub_tensor = None
2025-05-07T20:33:07.8189939Z     
2025-05-07T20:33:07.8190168Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8190493Z             op = silu_mul_quant
2025-05-07T20:33:07.8190746Z             if compiled:
2025-05-07T20:33:07.8190994Z                 op = torch.compile(op)
2025-05-07T20:33:07.8191291Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8191571Z     
2025-05-07T20:33:07.8191762Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.8192042Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.8192345Z     
2025-05-07T20:33:07.8192582Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8192918Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.8193376Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.8193702Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.8194072Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8194384Z     
2025-05-07T20:33:07.8194581Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.8194789Z 
2025-05-07T20:33:07.8194889Z moe/activation_test.py:126: 
2025-05-07T20:33:07.8195188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8195534Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.8195868Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8196710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.8197522Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.8198103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8198832Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8199564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.8200403Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8201264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.8202064Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8202837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.8203583Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.8204220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.8204763Z     fn()
2025-05-07T20:33:07.8205295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.8205914Z     self.fn.run(
2025-05-07T20:33:07.8206401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8206960Z     kernel = self.compile(
2025-05-07T20:33:07.8207525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8208216Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8208620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8208870Z 
2025-05-07T20:33:07.8209083Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1baa7b370>
2025-05-07T20:33:07.8210250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8211755Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1baa53550>}
2025-05-07T20:33:07.8213226Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8214330Z context = <triton._C.libtriton.ir.context object at 0x7fd1ba8d18b0>
2025-05-07T20:33:07.8214634Z 
2025-05-07T20:33:07.8214801Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8215348Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8215881Z                            module_map=module_map)
2025-05-07T20:33:07.8216251Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8216609Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.8216877Z E       ^
2025-05-07T20:33:07.8217361Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8217857Z 
2025-05-07T20:33:07.8218302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8219506Z 
2025-05-07T20:33:07.8219609Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8220031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8220450Z     T=16384,
2025-05-07T20:33:07.8220643Z     D=7168,
2025-05-07T20:33:07.8220832Z     scale_ub=1200.0,
2025-05-07T20:33:07.8221049Z     contiguous=False,
2025-05-07T20:33:07.8221277Z     compiled=False,
2025-05-07T20:33:07.8221479Z )
2025-05-07T20:33:07.8221794Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8222315Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.8222635Z 
2025-05-07T20:33:07.8222720Z     @given(
2025-05-07T20:33:07.8223017Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8223368Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8223682Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8224017Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8224344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8224635Z     )
2025-05-07T20:33:07.8224992Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8225490Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8225729Z         self,
2025-05-07T20:33:07.8225916Z         T: int,
2025-05-07T20:33:07.8226110Z         D: int,
2025-05-07T20:33:07.8226324Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8226596Z         contiguous: bool,
2025-05-07T20:33:07.8226829Z         compiled: bool,
2025-05-07T20:33:07.8227043Z     ) -> None:
2025-05-07T20:33:07.8227256Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8227493Z     
2025-05-07T20:33:07.8227758Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8228112Z     
2025-05-07T20:33:07.8228297Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8228583Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8228900Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8229139Z         x0 = x[:, :D]
2025-05-07T20:33:07.8229350Z         x1 = x[:, D:]
2025-05-07T20:33:07.8229565Z     
2025-05-07T20:33:07.8229750Z         if contiguous:
2025-05-07T20:33:07.8230079Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8230342Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8230587Z     
2025-05-07T20:33:07.8230775Z         if scale_ub is not None:
2025-05-07T20:33:07.8231053Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8231395Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8231715Z             )
2025-05-07T20:33:07.8231900Z         else:
2025-05-07T20:33:07.8232114Z             scale_ub_tensor = None
2025-05-07T20:33:07.8232372Z     
2025-05-07T20:33:07.8232599Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8232981Z             op = silu_mul_quant
2025-05-07T20:33:07.8233230Z             if compiled:
2025-05-07T20:33:07.8233479Z                 op = torch.compile(op)
2025-05-07T20:33:07.8233784Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8234060Z     
2025-05-07T20:33:07.8234255Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8234423Z 
2025-05-07T20:33:07.8234525Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8234871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8235219Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8235504Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8236243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8236983Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8237551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8238289Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8238990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8239562Z     kernel = self.compile(
2025-05-07T20:33:07.8240131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8240828Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8241236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8241483Z 
2025-05-07T20:33:07.8241693Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bac2f7c0>
2025-05-07T20:33:07.8242929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8244464Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1baa533a0>}
2025-05-07T20:33:07.8245931Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8247073Z context = <triton._C.libtriton.ir.context object at 0x7fd1ba424cf0>
2025-05-07T20:33:07.8247384Z 
2025-05-07T20:33:07.8247553Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8248104Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8248590Z                            module_map=module_map)
2025-05-07T20:33:07.8248965Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8249324Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8249588Z E       ^
2025-05-07T20:33:07.8250070Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8250566Z 
2025-05-07T20:33:07.8251018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8251570Z 
2025-05-07T20:33:07.8251678Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8252106Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8252524Z     T=1,
2025-05-07T20:33:07.8252708Z     D=7168,
2025-05-07T20:33:07.8252921Z     scale_ub=None,
2025-05-07T20:33:07.8253153Z     contiguous=True,
2025-05-07T20:33:07.8253374Z     compiled=True,
2025-05-07T20:33:07.8253577Z )
2025-05-07T20:33:07.8253893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8254401Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.8254670Z 
2025-05-07T20:33:07.8254756Z     @given(
2025-05-07T20:33:07.8254978Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8255304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8255621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8255958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8256340Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8256639Z     )
2025-05-07T20:33:07.8257003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8257461Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8257710Z         self,
2025-05-07T20:33:07.8257900Z         T: int,
2025-05-07T20:33:07.8258093Z         D: int,
2025-05-07T20:33:07.8258312Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8258592Z         contiguous: bool,
2025-05-07T20:33:07.8258824Z         compiled: bool,
2025-05-07T20:33:07.8259048Z     ) -> None:
2025-05-07T20:33:07.8259264Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8259505Z     
2025-05-07T20:33:07.8259782Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8260142Z     
2025-05-07T20:33:07.8260328Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8260628Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8260948Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8261196Z         x0 = x[:, :D]
2025-05-07T20:33:07.8261409Z         x1 = x[:, D:]
2025-05-07T20:33:07.8261618Z     
2025-05-07T20:33:07.8261804Z         if contiguous:
2025-05-07T20:33:07.8262031Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8262297Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8262542Z     
2025-05-07T20:33:07.8262794Z         if scale_ub is not None:
2025-05-07T20:33:07.8263095Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8263499Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8263812Z             )
2025-05-07T20:33:07.8264006Z         else:
2025-05-07T20:33:07.8264213Z             scale_ub_tensor = None
2025-05-07T20:33:07.8264463Z     
2025-05-07T20:33:07.8264690Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8265055Z             op = silu_mul_quant
2025-05-07T20:33:07.8265303Z             if compiled:
2025-05-07T20:33:07.8265554Z                 op = torch.compile(op)
2025-05-07T20:33:07.8265859Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8266137Z     
2025-05-07T20:33:07.8266318Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.8266605Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.8266903Z     
2025-05-07T20:33:07.8267134Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8267479Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.8267781Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.8268097Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.8268467Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8268789Z     
2025-05-07T20:33:07.8268982Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.8269192Z 
2025-05-07T20:33:07.8269293Z moe/activation_test.py:126: 
2025-05-07T20:33:07.8269597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8270056Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.8270385Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8271231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.8272052Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.8272631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8273405Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8274144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.8274914Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8275716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.8276572Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8277362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.8278049Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.8278685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.8279239Z     fn()
2025-05-07T20:33:07.8279779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.8280396Z     self.fn.run(
2025-05-07T20:33:07.8280887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8281456Z     kernel = self.compile(
2025-05-07T20:33:07.8282032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8282727Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8283393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8283640Z 
2025-05-07T20:33:07.8283944Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1baa592b0>
2025-05-07T20:33:07.8284804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8285417Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1ba9ff9d0>}
2025-05-07T20:33:07.8286296Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8286501Z context = <triton._C.libtriton.ir.context object at 0x7fd1ba53ecb0>
2025-05-07T20:33:07.8286505Z 
2025-05-07T20:33:07.8286676Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8286956Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8287073Z                            module_map=module_map)
2025-05-07T20:33:07.8287235Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8287334Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.8287415Z E       ^
2025-05-07T20:33:07.8287802Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8287811Z 
2025-05-07T20:33:07.8288264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8288270Z 
2025-05-07T20:33:07.8288374Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8288606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8288691Z     T=4096,
2025-05-07T20:33:07.8288765Z     D=5120,
2025-05-07T20:33:07.8288847Z     scale_ub=None,
2025-05-07T20:33:07.8288942Z     contiguous=False,
2025-05-07T20:33:07.8289028Z     compiled=False,
2025-05-07T20:33:07.8289106Z )
2025-05-07T20:33:07.8289338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8289518Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.8289522Z 
2025-05-07T20:33:07.8289600Z     @given(
2025-05-07T20:33:07.8289720Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8289821Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8289941Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8290119Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8290234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8290314Z     )
2025-05-07T20:33:07.8290571Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8290673Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8290755Z         self,
2025-05-07T20:33:07.8290833Z         T: int,
2025-05-07T20:33:07.8290918Z         D: int,
2025-05-07T20:33:07.8291014Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8291101Z         contiguous: bool,
2025-05-07T20:33:07.8291193Z         compiled: bool,
2025-05-07T20:33:07.8291270Z     ) -> None:
2025-05-07T20:33:07.8291365Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8291447Z     
2025-05-07T20:33:07.8291619Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8291690Z     
2025-05-07T20:33:07.8291787Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8291913Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8292010Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8292092Z         x0 = x[:, :D]
2025-05-07T20:33:07.8292172Z         x1 = x[:, D:]
2025-05-07T20:33:07.8292251Z     
2025-05-07T20:33:07.8292334Z         if contiguous:
2025-05-07T20:33:07.8292423Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8292567Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8292678Z     
2025-05-07T20:33:07.8292772Z         if scale_ub is not None:
2025-05-07T20:33:07.8292885Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8293020Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8293096Z             )
2025-05-07T20:33:07.8293181Z         else:
2025-05-07T20:33:07.8293275Z             scale_ub_tensor = None
2025-05-07T20:33:07.8293388Z     
2025-05-07T20:33:07.8293527Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8293619Z             op = silu_mul_quant
2025-05-07T20:33:07.8293715Z             if compiled:
2025-05-07T20:33:07.8293815Z                 op = torch.compile(op)
2025-05-07T20:33:07.8293920Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8294001Z     
2025-05-07T20:33:07.8294089Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8294094Z 
2025-05-07T20:33:07.8294191Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8294336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8294442Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8294542Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8295097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8295194Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8295589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8295825Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8296186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8296284Z     kernel = self.compile(
2025-05-07T20:33:07.8296692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8296875Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8297009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8297014Z 
2025-05-07T20:33:07.8297224Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1ba5fa820>
2025-05-07T20:33:07.8298077Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8298886Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1ba86fe50>}
2025-05-07T20:33:07.8299838Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8300041Z context = <triton._C.libtriton.ir.context object at 0x7fd1b9fc5ef0>
2025-05-07T20:33:07.8300046Z 
2025-05-07T20:33:07.8300217Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8300501Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8300614Z                            module_map=module_map)
2025-05-07T20:33:07.8300786Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8300885Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8300969Z E       ^
2025-05-07T20:33:07.8301363Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8301368Z 
2025-05-07T20:33:07.8301817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8301868Z 
2025-05-07T20:33:07.8301981Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8302248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8302326Z     T=4096,
2025-05-07T20:33:07.8302405Z     D=7168,
2025-05-07T20:33:07.8302489Z     scale_ub=None,
2025-05-07T20:33:07.8302576Z     contiguous=False,
2025-05-07T20:33:07.8302683Z     compiled=False,
2025-05-07T20:33:07.8302810Z )
2025-05-07T20:33:07.8303052Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8303239Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.8303246Z 
2025-05-07T20:33:07.8303323Z     @given(
2025-05-07T20:33:07.8303447Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8303544Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8303658Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8303783Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8303896Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8303973Z     )
2025-05-07T20:33:07.8304237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8304328Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8304401Z         self,
2025-05-07T20:33:07.8304482Z         T: int,
2025-05-07T20:33:07.8304559Z         D: int,
2025-05-07T20:33:07.8304659Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8304758Z         contiguous: bool,
2025-05-07T20:33:07.8304841Z         compiled: bool,
2025-05-07T20:33:07.8304925Z     ) -> None:
2025-05-07T20:33:07.8305020Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8305094Z     
2025-05-07T20:33:07.8305270Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8305344Z     
2025-05-07T20:33:07.8305437Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8305568Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8305656Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8305736Z         x0 = x[:, :D]
2025-05-07T20:33:07.8305819Z         x1 = x[:, D:]
2025-05-07T20:33:07.8305893Z     
2025-05-07T20:33:07.8305977Z         if contiguous:
2025-05-07T20:33:07.8306074Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8306162Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8306243Z     
2025-05-07T20:33:07.8306341Z         if scale_ub is not None:
2025-05-07T20:33:07.8306446Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8306587Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8306707Z             )
2025-05-07T20:33:07.8306784Z         else:
2025-05-07T20:33:07.8306886Z             scale_ub_tensor = None
2025-05-07T20:33:07.8314983Z     
2025-05-07T20:33:07.8315151Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8315254Z             op = silu_mul_quant
2025-05-07T20:33:07.8315351Z             if compiled:
2025-05-07T20:33:07.8315453Z                 op = torch.compile(op)
2025-05-07T20:33:07.8315571Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8315644Z     
2025-05-07T20:33:07.8315738Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8315755Z 
2025-05-07T20:33:07.8315856Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8315993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8316110Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8316213Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8316771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8316885Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8317276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8317591Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8317961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8318097Z     kernel = self.compile(
2025-05-07T20:33:07.8318518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8318705Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8318877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8318882Z 
2025-05-07T20:33:07.8319109Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1bab8f310>
2025-05-07T20:33:07.8319971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8320536Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1ba5a8a60>}
2025-05-07T20:33:07.8321359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8321569Z context = <triton._C.libtriton.ir.context object at 0x7fd1ba37c670>
2025-05-07T20:33:07.8321576Z 
2025-05-07T20:33:07.8321750Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8322036Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8322158Z                            module_map=module_map)
2025-05-07T20:33:07.8322327Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8322428Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8322514Z E       ^
2025-05-07T20:33:07.8322900Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8322907Z 
2025-05-07T20:33:07.8323365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8323369Z 
2025-05-07T20:33:07.8323473Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8323712Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8323800Z     T=128,
2025-05-07T20:33:07.8323879Z     D=7168,
2025-05-07T20:33:07.8323960Z     scale_ub=None,
2025-05-07T20:33:07.8324105Z     contiguous=False,
2025-05-07T20:33:07.8324191Z     compiled=True,
2025-05-07T20:33:07.8324277Z )
2025-05-07T20:33:07.8324505Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8324682Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.8324690Z 
2025-05-07T20:33:07.8324774Z     @given(
2025-05-07T20:33:07.8324900Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8325003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8325129Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8325252Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8325377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8325455Z     )
2025-05-07T20:33:07.8325714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8325816Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8325899Z         self,
2025-05-07T20:33:07.8325979Z         T: int,
2025-05-07T20:33:07.8326064Z         D: int,
2025-05-07T20:33:07.8326164Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8326253Z         contiguous: bool,
2025-05-07T20:33:07.8326347Z         compiled: bool,
2025-05-07T20:33:07.8326428Z     ) -> None:
2025-05-07T20:33:07.8326566Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8326653Z     
2025-05-07T20:33:07.8326863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8326939Z     
2025-05-07T20:33:07.8327040Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8327167Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8327266Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8327390Z         x0 = x[:, :D]
2025-05-07T20:33:07.8327478Z         x1 = x[:, D:]
2025-05-07T20:33:07.8327560Z     
2025-05-07T20:33:07.8327639Z         if contiguous:
2025-05-07T20:33:07.8327743Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8327836Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8327911Z     
2025-05-07T20:33:07.8328006Z         if scale_ub is not None:
2025-05-07T20:33:07.8328114Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8328253Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8328338Z             )
2025-05-07T20:33:07.8328417Z         else:
2025-05-07T20:33:07.8328523Z             scale_ub_tensor = None
2025-05-07T20:33:07.8328594Z     
2025-05-07T20:33:07.8328725Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8328818Z             op = silu_mul_quant
2025-05-07T20:33:07.8328905Z             if compiled:
2025-05-07T20:33:07.8329006Z                 op = torch.compile(op)
2025-05-07T20:33:07.8329122Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8329196Z     
2025-05-07T20:33:07.8329288Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.8329418Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.8329494Z     
2025-05-07T20:33:07.8329632Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8329738Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.8329838Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.8329967Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.8330111Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8330186Z     
2025-05-07T20:33:07.8330291Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.8330296Z 
2025-05-07T20:33:07.8330395Z moe/activation_test.py:126: 
2025-05-07T20:33:07.8330527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8330635Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.8330775Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8331437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.8331539Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.8331925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8332167Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8332571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.8332888Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8333316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.8333585Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8333994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.8334165Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.8334530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.8334613Z     fn()
2025-05-07T20:33:07.8335083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.8335210Z     self.fn.run(
2025-05-07T20:33:07.8335569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8335665Z     kernel = self.compile(
2025-05-07T20:33:07.8336077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8336320Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8336454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8336463Z 
2025-05-07T20:33:07.8336674Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1ba4a19a0>
2025-05-07T20:33:07.8337525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8338084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1ba5e2550>}
2025-05-07T20:33:07.8338896Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8339102Z context = <triton._C.libtriton.ir.context object at 0x7fd1b9ed79b0>
2025-05-07T20:33:07.8339107Z 
2025-05-07T20:33:07.8339279Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8339556Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8339670Z                            module_map=module_map)
2025-05-07T20:33:07.8339836Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8339939Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.8340025Z E       ^
2025-05-07T20:33:07.8340409Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8340414Z 
2025-05-07T20:33:07.8340868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8340874Z 
2025-05-07T20:33:07.8340976Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8341205Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8341332Z     T=128,
2025-05-07T20:33:07.8341411Z     D=7168,
2025-05-07T20:33:07.8341499Z     scale_ub=None,
2025-05-07T20:33:07.8341585Z     contiguous=False,
2025-05-07T20:33:07.8341669Z     compiled=False,
2025-05-07T20:33:07.8341746Z )
2025-05-07T20:33:07.8341972Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8342154Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.8342161Z 
2025-05-07T20:33:07.8342245Z     @given(
2025-05-07T20:33:07.8342368Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8342468Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8342589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8342706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8342826Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8342903Z     )
2025-05-07T20:33:07.8343162Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8343260Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8343336Z         self,
2025-05-07T20:33:07.8343412Z         T: int,
2025-05-07T20:33:07.8343496Z         D: int,
2025-05-07T20:33:07.8343595Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8343685Z         contiguous: bool,
2025-05-07T20:33:07.8343820Z         compiled: bool,
2025-05-07T20:33:07.8343900Z     ) -> None:
2025-05-07T20:33:07.8344034Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8344119Z     
2025-05-07T20:33:07.8344292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8344373Z     
2025-05-07T20:33:07.8344464Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8344588Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8344794Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8344876Z         x0 = x[:, :D]
2025-05-07T20:33:07.8344958Z         x1 = x[:, D:]
2025-05-07T20:33:07.8345036Z     
2025-05-07T20:33:07.8345123Z         if contiguous:
2025-05-07T20:33:07.8345218Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8345315Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8345389Z     
2025-05-07T20:33:07.8345480Z         if scale_ub is not None:
2025-05-07T20:33:07.8345592Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8345731Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8345810Z             )
2025-05-07T20:33:07.8345893Z         else:
2025-05-07T20:33:07.8345989Z             scale_ub_tensor = None
2025-05-07T20:33:07.8346068Z     
2025-05-07T20:33:07.8346198Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8346288Z             op = silu_mul_quant
2025-05-07T20:33:07.8346379Z             if compiled:
2025-05-07T20:33:07.8346483Z                 op = torch.compile(op)
2025-05-07T20:33:07.8346589Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8346667Z     
2025-05-07T20:33:07.8346761Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8346766Z 
2025-05-07T20:33:07.8346865Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8347002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8347100Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8347204Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8347749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8347849Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8348238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8348471Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8348835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8348931Z     kernel = self.compile(
2025-05-07T20:33:07.8349386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8349569Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8349700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8349707Z 
2025-05-07T20:33:07.8350063Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1ba2be4c0>
2025-05-07T20:33:07.8350921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8351468Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1ba5a8ee0>}
2025-05-07T20:33:07.8352293Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8352491Z context = <triton._C.libtriton.ir.context object at 0x7fd1b9ea39f0>
2025-05-07T20:33:07.8352496Z 
2025-05-07T20:33:07.8352723Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8353036Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8353144Z                            module_map=module_map)
2025-05-07T20:33:07.8353313Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8353408Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8353527Z E       ^
2025-05-07T20:33:07.8353914Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8353919Z 
2025-05-07T20:33:07.8354368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8354373Z 
2025-05-07T20:33:07.8354476Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8354708Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8354787Z     T=4096,
2025-05-07T20:33:07.8354873Z     D=5120,
2025-05-07T20:33:07.8354954Z     scale_ub=1200.0,
2025-05-07T20:33:07.8355046Z     contiguous=True,
2025-05-07T20:33:07.8355136Z     compiled=False,
2025-05-07T20:33:07.8355209Z )
2025-05-07T20:33:07.8355435Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8355621Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.8355629Z 
2025-05-07T20:33:07.8355706Z     @given(
2025-05-07T20:33:07.8355834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8355933Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8356050Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8356175Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8356287Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8356363Z     )
2025-05-07T20:33:07.8356629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8356724Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8356809Z         self,
2025-05-07T20:33:07.8356886Z         T: int,
2025-05-07T20:33:07.8356961Z         D: int,
2025-05-07T20:33:07.8357066Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8357155Z         contiguous: bool,
2025-05-07T20:33:07.8357239Z         compiled: bool,
2025-05-07T20:33:07.8357320Z     ) -> None:
2025-05-07T20:33:07.8357419Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8357491Z     
2025-05-07T20:33:07.8357668Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8357742Z     
2025-05-07T20:33:07.8357878Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8358012Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8358103Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8358186Z         x0 = x[:, :D]
2025-05-07T20:33:07.8358273Z         x1 = x[:, D:]
2025-05-07T20:33:07.8358343Z     
2025-05-07T20:33:07.8358433Z         if contiguous:
2025-05-07T20:33:07.8358526Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8358619Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8358700Z     
2025-05-07T20:33:07.8358792Z         if scale_ub is not None:
2025-05-07T20:33:07.8358899Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8359041Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8359117Z             )
2025-05-07T20:33:07.8359195Z         else:
2025-05-07T20:33:07.8359296Z             scale_ub_tensor = None
2025-05-07T20:33:07.8359370Z     
2025-05-07T20:33:07.8359504Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8359602Z             op = silu_mul_quant
2025-05-07T20:33:07.8359687Z             if compiled:
2025-05-07T20:33:07.8359795Z                 op = torch.compile(op)
2025-05-07T20:33:07.8359899Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8359969Z     
2025-05-07T20:33:07.8360103Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8360108Z 
2025-05-07T20:33:07.8360202Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8360373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8360477Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8360577Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8361119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8361266Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8361654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8361893Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8362260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8362354Z     kernel = self.compile(
2025-05-07T20:33:07.8362771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8362952Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8363088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8363092Z 
2025-05-07T20:33:07.8363304Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b9ebfcd0>
2025-05-07T20:33:07.8364159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8364712Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1ba1350d0>}
2025-05-07T20:33:07.8365527Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8365734Z context = <triton._C.libtriton.ir.context object at 0x7fd1b99b8870>
2025-05-07T20:33:07.8365738Z 
2025-05-07T20:33:07.8365908Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8366185Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8366301Z                            module_map=module_map)
2025-05-07T20:33:07.8366507Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8366613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8366689Z E       ^
2025-05-07T20:33:07.8367070Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8367075Z 
2025-05-07T20:33:07.8367531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8367538Z 
2025-05-07T20:33:07.8367643Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8367882Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8367959Z     T=1,
2025-05-07T20:33:07.8368035Z     D=5120,
2025-05-07T20:33:07.8368123Z     scale_ub=None,
2025-05-07T20:33:07.8368210Z     contiguous=True,
2025-05-07T20:33:07.8368293Z     compiled=True,
2025-05-07T20:33:07.8368371Z )
2025-05-07T20:33:07.8368597Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8368766Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.8368771Z 
2025-05-07T20:33:07.8368857Z     @given(
2025-05-07T20:33:07.8368976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8369074Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8369235Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8369356Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8369540Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8369616Z     )
2025-05-07T20:33:07.8369874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8369973Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8370049Z         self,
2025-05-07T20:33:07.8370164Z         T: int,
2025-05-07T20:33:07.8370248Z         D: int,
2025-05-07T20:33:07.8370348Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8370437Z         contiguous: bool,
2025-05-07T20:33:07.8370532Z         compiled: bool,
2025-05-07T20:33:07.8370611Z     ) -> None:
2025-05-07T20:33:07.8370705Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8370787Z     
2025-05-07T20:33:07.8370959Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8371036Z     
2025-05-07T20:33:07.8371130Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8371255Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8371357Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8371439Z         x0 = x[:, :D]
2025-05-07T20:33:07.8371521Z         x1 = x[:, D:]
2025-05-07T20:33:07.8371601Z     
2025-05-07T20:33:07.8371685Z         if contiguous:
2025-05-07T20:33:07.8371775Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8371872Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8371947Z     
2025-05-07T20:33:07.8372038Z         if scale_ub is not None:
2025-05-07T20:33:07.8372151Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8372291Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8372374Z             )
2025-05-07T20:33:07.8372451Z         else:
2025-05-07T20:33:07.8372546Z             scale_ub_tensor = None
2025-05-07T20:33:07.8372623Z     
2025-05-07T20:33:07.8372754Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8372849Z             op = silu_mul_quant
2025-05-07T20:33:07.8372942Z             if compiled:
2025-05-07T20:33:07.8373044Z                 op = torch.compile(op)
2025-05-07T20:33:07.8373150Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8373233Z     
2025-05-07T20:33:07.8373324Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.8373446Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.8373531Z     
2025-05-07T20:33:07.8373670Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8373781Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.8373928Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.8374055Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.8374203Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8374276Z     
2025-05-07T20:33:07.8374378Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.8374382Z 
2025-05-07T20:33:07.8374494Z moe/activation_test.py:126: 
2025-05-07T20:33:07.8374626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8374731Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.8374872Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8375478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.8375587Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.8375975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8376209Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8376616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.8376927Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8377357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.8377667Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8378071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.8378283Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.8378658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.8378736Z     fn()
2025-05-07T20:33:07.8379172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.8379256Z     self.fn.run(
2025-05-07T20:33:07.8379617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8379721Z     kernel = self.compile(
2025-05-07T20:33:07.8380131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8380316Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8380448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8380455Z 
2025-05-07T20:33:07.8380667Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b9e0e490>
2025-05-07T20:33:07.8381528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8382078Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b9dfb4c0>}
2025-05-07T20:33:07.8383232Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8383458Z context = <triton._C.libtriton.ir.context object at 0x7fd1b9ddadb0>
2025-05-07T20:33:07.8383463Z 
2025-05-07T20:33:07.8383633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8383922Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8384157Z                            module_map=module_map)
2025-05-07T20:33:07.8384329Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8384435Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.8384511Z E       ^
2025-05-07T20:33:07.8384908Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8384912Z 
2025-05-07T20:33:07.8385362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8385366Z 
2025-05-07T20:33:07.8385481Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8385710Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8385784Z     T=2048,
2025-05-07T20:33:07.8385866Z     D=5120,
2025-05-07T20:33:07.8385951Z     scale_ub=None,
2025-05-07T20:33:07.8386036Z     contiguous=True,
2025-05-07T20:33:07.8386124Z     compiled=True,
2025-05-07T20:33:07.8386196Z )
2025-05-07T20:33:07.8386423Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8386605Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.8386610Z 
2025-05-07T20:33:07.8386686Z     @given(
2025-05-07T20:33:07.8386804Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8386989Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8387161Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8387282Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8387397Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8387470Z     )
2025-05-07T20:33:07.8387732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8387885Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8387960Z         self,
2025-05-07T20:33:07.8388044Z         T: int,
2025-05-07T20:33:07.8388120Z         D: int,
2025-05-07T20:33:07.8388219Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8388314Z         contiguous: bool,
2025-05-07T20:33:07.8388399Z         compiled: bool,
2025-05-07T20:33:07.8388486Z     ) -> None:
2025-05-07T20:33:07.8388581Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8388655Z     
2025-05-07T20:33:07.8388838Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8388912Z     
2025-05-07T20:33:07.8389007Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8389138Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8389225Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8389307Z         x0 = x[:, :D]
2025-05-07T20:33:07.8389391Z         x1 = x[:, D:]
2025-05-07T20:33:07.8389462Z     
2025-05-07T20:33:07.8389542Z         if contiguous:
2025-05-07T20:33:07.8389647Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8389736Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8389889Z     
2025-05-07T20:33:07.8389988Z         if scale_ub is not None:
2025-05-07T20:33:07.8390097Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8390237Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8390312Z             )
2025-05-07T20:33:07.8390387Z         else:
2025-05-07T20:33:07.8390485Z             scale_ub_tensor = None
2025-05-07T20:33:07.8390557Z     
2025-05-07T20:33:07.8390687Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8390787Z             op = silu_mul_quant
2025-05-07T20:33:07.8390871Z             if compiled:
2025-05-07T20:33:07.8390971Z                 op = torch.compile(op)
2025-05-07T20:33:07.8391082Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8391154Z     
2025-05-07T20:33:07.8391243Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.8391373Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.8391447Z     
2025-05-07T20:33:07.8391590Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8391740Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.8391840Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.8391967Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.8392108Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8392178Z     
2025-05-07T20:33:07.8392286Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.8392292Z 
2025-05-07T20:33:07.8392388Z moe/activation_test.py:126: 
2025-05-07T20:33:07.8392520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8392630Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.8392765Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8393383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.8393482Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.8393870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8394112Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8394545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.8394820Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8395284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.8395549Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8395995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.8396166Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.8396541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.8396618Z     fn()
2025-05-07T20:33:07.8397047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.8397137Z     self.fn.run(
2025-05-07T20:33:07.8397497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8397592Z     kernel = self.compile(
2025-05-07T20:33:07.8398005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8398184Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8398324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8398328Z 
2025-05-07T20:33:07.8398538Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b9a81c40>
2025-05-07T20:33:07.8399385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8399942Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b9a47f70>}
2025-05-07T20:33:07.8400758Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8400959Z context = <triton._C.libtriton.ir.context object at 0x7fd1b9c4b5f0>
2025-05-07T20:33:07.8400967Z 
2025-05-07T20:33:07.8401135Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8401453Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8401566Z                            module_map=module_map)
2025-05-07T20:33:07.8401728Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8401834Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.8401910Z E       ^
2025-05-07T20:33:07.8402295Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8402302Z 
2025-05-07T20:33:07.8402753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8402757Z 
2025-05-07T20:33:07.8402855Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8403093Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8403173Z     T=128,
2025-05-07T20:33:07.8403245Z     D=5120,
2025-05-07T20:33:07.8403338Z     scale_ub=None,
2025-05-07T20:33:07.8403428Z     contiguous=True,
2025-05-07T20:33:07.8403511Z     compiled=True,
2025-05-07T20:33:07.8403588Z )
2025-05-07T20:33:07.8403812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8403981Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.8403986Z 
2025-05-07T20:33:07.8404138Z     @given(
2025-05-07T20:33:07.8404259Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8404397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8404515Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8404631Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8404748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8404822Z     )
2025-05-07T20:33:07.8405122Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8405217Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8405292Z         self,
2025-05-07T20:33:07.8405372Z         T: int,
2025-05-07T20:33:07.8405455Z         D: int,
2025-05-07T20:33:07.8405556Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8405643Z         contiguous: bool,
2025-05-07T20:33:07.8405732Z         compiled: bool,
2025-05-07T20:33:07.8405808Z     ) -> None:
2025-05-07T20:33:07.8405905Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8405988Z     
2025-05-07T20:33:07.8406158Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8406244Z     
2025-05-07T20:33:07.8406334Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8406457Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8406553Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8406632Z         x0 = x[:, :D]
2025-05-07T20:33:07.8406714Z         x1 = x[:, D:]
2025-05-07T20:33:07.8406792Z     
2025-05-07T20:33:07.8406875Z         if contiguous:
2025-05-07T20:33:07.8406964Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8407065Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8407140Z     
2025-05-07T20:33:07.8407232Z         if scale_ub is not None:
2025-05-07T20:33:07.8407342Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8407477Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8407561Z             )
2025-05-07T20:33:07.8407639Z         else:
2025-05-07T20:33:07.8407733Z             scale_ub_tensor = None
2025-05-07T20:33:07.8407818Z     
2025-05-07T20:33:07.8407946Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8408037Z             op = silu_mul_quant
2025-05-07T20:33:07.8408128Z             if compiled:
2025-05-07T20:33:07.8408227Z                 op = torch.compile(op)
2025-05-07T20:33:07.8408331Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8408414Z     
2025-05-07T20:33:07.8408504Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.8408625Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.8408697Z     
2025-05-07T20:33:07.8408883Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8408992Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.8409088Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.8409211Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.8409366Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8409437Z     
2025-05-07T20:33:07.8409537Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.8409542Z 
2025-05-07T20:33:07.8409648Z moe/activation_test.py:126: 
2025-05-07T20:33:07.8409779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8409883Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.8410029Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8410641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.8410745Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.8411130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8411362Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8411808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.8412113Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8412545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.8412812Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8413252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.8413434Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.8413798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.8413884Z     fn()
2025-05-07T20:33:07.8414315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.8414401Z     self.fn.run(
2025-05-07T20:33:07.8414765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8414859Z     kernel = self.compile(
2025-05-07T20:33:07.8415266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8415456Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8415586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8415590Z 
2025-05-07T20:33:07.8415808Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b94f5af0>
2025-05-07T20:33:07.8416658Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8417206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b9d0f0d0>}
2025-05-07T20:33:07.8418027Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8418227Z context = <triton._C.libtriton.ir.context object at 0x7fd1b978b0b0>
2025-05-07T20:33:07.8418232Z 
2025-05-07T20:33:07.8418457Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8418734Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8418838Z                            module_map=module_map)
2025-05-07T20:33:07.8419006Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8419112Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.8419191Z E       ^
2025-05-07T20:33:07.8419574Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8419579Z 
2025-05-07T20:33:07.8420024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8420029Z 
2025-05-07T20:33:07.8420144Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8420372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8420457Z     T=4096,
2025-05-07T20:33:07.8420537Z     D=5120,
2025-05-07T20:33:07.8420618Z     scale_ub=None,
2025-05-07T20:33:07.8420710Z     contiguous=True,
2025-05-07T20:33:07.8420790Z     compiled=True,
2025-05-07T20:33:07.8420863Z )
2025-05-07T20:33:07.8421098Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8421313Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.8421355Z 
2025-05-07T20:33:07.8421430Z     @given(
2025-05-07T20:33:07.8421555Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8421651Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8421766Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8421888Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8422042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8422124Z     )
2025-05-07T20:33:07.8422381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8422474Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8422558Z         self,
2025-05-07T20:33:07.8422632Z         T: int,
2025-05-07T20:33:07.8422709Z         D: int,
2025-05-07T20:33:07.8422812Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8422898Z         contiguous: bool,
2025-05-07T20:33:07.8422984Z         compiled: bool,
2025-05-07T20:33:07.8423067Z     ) -> None:
2025-05-07T20:33:07.8423166Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8423239Z     
2025-05-07T20:33:07.8423418Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8423495Z     
2025-05-07T20:33:07.8423591Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8423712Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8423802Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8423887Z         x0 = x[:, :D]
2025-05-07T20:33:07.8423967Z         x1 = x[:, D:]
2025-05-07T20:33:07.8424038Z     
2025-05-07T20:33:07.8424129Z         if contiguous:
2025-05-07T20:33:07.8424221Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8424309Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8424389Z     
2025-05-07T20:33:07.8424480Z         if scale_ub is not None:
2025-05-07T20:33:07.8424582Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8424726Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8424803Z             )
2025-05-07T20:33:07.8424888Z         else:
2025-05-07T20:33:07.8424983Z             scale_ub_tensor = None
2025-05-07T20:33:07.8425056Z     
2025-05-07T20:33:07.8425192Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8425283Z             op = silu_mul_quant
2025-05-07T20:33:07.8425365Z             if compiled:
2025-05-07T20:33:07.8425473Z                 op = torch.compile(op)
2025-05-07T20:33:07.8425578Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8425646Z     
2025-05-07T20:33:07.8425743Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.8425910Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.8425984Z     
2025-05-07T20:33:07.8426128Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8426229Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.8426336Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.8426461Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.8426603Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8426686Z     
2025-05-07T20:33:07.8426785Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.8426790Z 
2025-05-07T20:33:07.8426887Z moe/activation_test.py:126: 
2025-05-07T20:33:07.8427026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8427131Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.8427266Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8427880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.8427978Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.8428369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8428644Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8429077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.8429352Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8429848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.8430167Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8430572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.8430742Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.8431114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.8431195Z     fn()
2025-05-07T20:33:07.8431638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.8431722Z     self.fn.run(
2025-05-07T20:33:07.8432080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8432179Z     kernel = self.compile(
2025-05-07T20:33:07.8432590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8432800Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8432955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8432960Z 
2025-05-07T20:33:07.8433170Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b9681ac0>
2025-05-07T20:33:07.8434033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8434589Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b981ddc0>}
2025-05-07T20:33:07.8435408Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8435614Z context = <triton._C.libtriton.ir.context object at 0x7fd1b93743f0>
2025-05-07T20:33:07.8435684Z 
2025-05-07T20:33:07.8435857Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8436142Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8436247Z                            module_map=module_map)
2025-05-07T20:33:07.8436413Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8436523Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.8436600Z E       ^
2025-05-07T20:33:07.8436987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8436992Z 
2025-05-07T20:33:07.8437441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8437447Z 
2025-05-07T20:33:07.8437547Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8437787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8437866Z     T=16384,
2025-05-07T20:33:07.8437941Z     D=5120,
2025-05-07T20:33:07.8438030Z     scale_ub=None,
2025-05-07T20:33:07.8438114Z     contiguous=True,
2025-05-07T20:33:07.8438203Z     compiled=True,
2025-05-07T20:33:07.8438276Z )
2025-05-07T20:33:07.8438542Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8438775Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.8438780Z 
2025-05-07T20:33:07.8438858Z     @given(
2025-05-07T20:33:07.8438977Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8439082Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8439198Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8439353Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8439471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8439546Z     )
2025-05-07T20:33:07.8439813Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8439906Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8439983Z         self,
2025-05-07T20:33:07.8440068Z         T: int,
2025-05-07T20:33:07.8440141Z         D: int,
2025-05-07T20:33:07.8440241Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8440337Z         contiguous: bool,
2025-05-07T20:33:07.8440427Z         compiled: bool,
2025-05-07T20:33:07.8440507Z     ) -> None:
2025-05-07T20:33:07.8440609Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8440683Z     
2025-05-07T20:33:07.8440853Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8440931Z     
2025-05-07T20:33:07.8441018Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8441149Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8441237Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8441317Z         x0 = x[:, :D]
2025-05-07T20:33:07.8441405Z         x1 = x[:, D:]
2025-05-07T20:33:07.8441475Z     
2025-05-07T20:33:07.8441556Z         if contiguous:
2025-05-07T20:33:07.8441660Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8441748Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8441820Z     
2025-05-07T20:33:07.8441918Z         if scale_ub is not None:
2025-05-07T20:33:07.8442025Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8442160Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8442242Z             )
2025-05-07T20:33:07.8442316Z         else:
2025-05-07T20:33:07.8442416Z             scale_ub_tensor = None
2025-05-07T20:33:07.8442492Z     
2025-05-07T20:33:07.8442623Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8442724Z             op = silu_mul_quant
2025-05-07T20:33:07.8442809Z             if compiled:
2025-05-07T20:33:07.8442911Z                 op = torch.compile(op)
2025-05-07T20:33:07.8443023Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8443147Z     
2025-05-07T20:33:07.8443240Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.8443368Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.8443442Z     
2025-05-07T20:33:07.8463027Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8463169Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.8463272Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.8463402Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.8463548Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8463621Z     
2025-05-07T20:33:07.8463723Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.8463729Z 
2025-05-07T20:33:07.8463835Z moe/activation_test.py:126: 
2025-05-07T20:33:07.8463976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8464092Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.8464235Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8464871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.8464990Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.8465477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8465770Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8466171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.8466440Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8466939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.8467206Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8467618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.8467791Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.8468159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.8468247Z     fn()
2025-05-07T20:33:07.8468676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.8468756Z     self.fn.run(
2025-05-07T20:33:07.8469126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8469225Z     kernel = self.compile(
2025-05-07T20:33:07.8469643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8469981Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8470117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8470122Z 
2025-05-07T20:33:07.8470343Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b9db5eb0>
2025-05-07T20:33:07.8471199Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8471762Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b9aed310>}
2025-05-07T20:33:07.8472631Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8472831Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8fc7130>
2025-05-07T20:33:07.8472836Z 
2025-05-07T20:33:07.8473015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8473295Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8473412Z                            module_map=module_map)
2025-05-07T20:33:07.8473580Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8473684Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.8473774Z E       ^
2025-05-07T20:33:07.8474157Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8474164Z 
2025-05-07T20:33:07.8474617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8474622Z 
2025-05-07T20:33:07.8474725Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8474958Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8475043Z     T=1,
2025-05-07T20:33:07.8475120Z     D=5120,
2025-05-07T20:33:07.8475204Z     scale_ub=1200.0,
2025-05-07T20:33:07.8475298Z     contiguous=True,
2025-05-07T20:33:07.8475422Z     compiled=True,
2025-05-07T20:33:07.8475500Z )
2025-05-07T20:33:07.8475771Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8475942Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.8475946Z 
2025-05-07T20:33:07.8476028Z     @given(
2025-05-07T20:33:07.8476147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8476284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8476406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8476523Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8476640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8476723Z     )
2025-05-07T20:33:07.8476982Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8477078Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8477158Z         self,
2025-05-07T20:33:07.8477236Z         T: int,
2025-05-07T20:33:07.8477321Z         D: int,
2025-05-07T20:33:07.8477417Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8477509Z         contiguous: bool,
2025-05-07T20:33:07.8477599Z         compiled: bool,
2025-05-07T20:33:07.8477676Z     ) -> None:
2025-05-07T20:33:07.8477769Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8477847Z     
2025-05-07T20:33:07.8478021Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8478094Z     
2025-05-07T20:33:07.8478194Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8478317Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8478406Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8478496Z         x0 = x[:, :D]
2025-05-07T20:33:07.8478571Z         x1 = x[:, D:]
2025-05-07T20:33:07.8478652Z     
2025-05-07T20:33:07.8478733Z         if contiguous:
2025-05-07T20:33:07.8478821Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8478914Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8478988Z     
2025-05-07T20:33:07.8479083Z         if scale_ub is not None:
2025-05-07T20:33:07.8479190Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8479326Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8479404Z             )
2025-05-07T20:33:07.8479479Z         else:
2025-05-07T20:33:07.8479572Z             scale_ub_tensor = None
2025-05-07T20:33:07.8479646Z     
2025-05-07T20:33:07.8479778Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8479866Z             op = silu_mul_quant
2025-05-07T20:33:07.8479953Z             if compiled:
2025-05-07T20:33:07.8480099Z                 op = torch.compile(op)
2025-05-07T20:33:07.8480205Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8480285Z     
2025-05-07T20:33:07.8480373Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8480378Z 
2025-05-07T20:33:07.8480480Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8480612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8480711Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8480818Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8481208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8481298Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8481838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8481938Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8482328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8482560Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8483251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8483536Z     kernel = self.compile(
2025-05-07T20:33:07.8483949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8484236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8484371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8484376Z 
2025-05-07T20:33:07.8484590Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b9887a30>
2025-05-07T20:33:07.8485519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8486071Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b90549d0>}
2025-05-07T20:33:07.8486895Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8487094Z context = <triton._C.libtriton.ir.context object at 0x7fd1b89bcd70>
2025-05-07T20:33:07.8487098Z 
2025-05-07T20:33:07.8487265Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8487551Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8487661Z                            module_map=module_map)
2025-05-07T20:33:07.8487832Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8487929Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8488005Z E       ^
2025-05-07T20:33:07.8488392Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8488397Z 
2025-05-07T20:33:07.8488842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8488849Z 
2025-05-07T20:33:07.8488951Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8489191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8489265Z     T=1,
2025-05-07T20:33:07.8489346Z     D=5120,
2025-05-07T20:33:07.8489430Z     scale_ub=None,
2025-05-07T20:33:07.8489513Z     contiguous=False,
2025-05-07T20:33:07.8489599Z     compiled=True,
2025-05-07T20:33:07.8489671Z )
2025-05-07T20:33:07.8489961Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8490140Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.8490145Z 
2025-05-07T20:33:07.8490222Z     @given(
2025-05-07T20:33:07.8490341Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8490443Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8490561Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8490685Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8490796Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8490868Z     )
2025-05-07T20:33:07.8491133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8491223Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8491299Z         self,
2025-05-07T20:33:07.8491381Z         T: int,
2025-05-07T20:33:07.8491456Z         D: int,
2025-05-07T20:33:07.8491554Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8491653Z         contiguous: bool,
2025-05-07T20:33:07.8491738Z         compiled: bool,
2025-05-07T20:33:07.8491817Z     ) -> None:
2025-05-07T20:33:07.8491918Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8491989Z     
2025-05-07T20:33:07.8492167Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8492240Z     
2025-05-07T20:33:07.8492376Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8492546Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8492634Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8492713Z         x0 = x[:, :D]
2025-05-07T20:33:07.8492801Z         x1 = x[:, D:]
2025-05-07T20:33:07.8492875Z     
2025-05-07T20:33:07.8492956Z         if contiguous:
2025-05-07T20:33:07.8493052Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8493183Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8493257Z     
2025-05-07T20:33:07.8493354Z         if scale_ub is not None:
2025-05-07T20:33:07.8493457Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8493602Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8493676Z             )
2025-05-07T20:33:07.8493751Z         else:
2025-05-07T20:33:07.8493853Z             scale_ub_tensor = None
2025-05-07T20:33:07.8493923Z     
2025-05-07T20:33:07.8494054Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8494150Z             op = silu_mul_quant
2025-05-07T20:33:07.8494241Z             if compiled:
2025-05-07T20:33:07.8494340Z                 op = torch.compile(op)
2025-05-07T20:33:07.8494452Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8494525Z     
2025-05-07T20:33:07.8494615Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.8494743Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.8494819Z     
2025-05-07T20:33:07.8494960Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8495062Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.8495163Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.8495289Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.8495429Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8495504Z     
2025-05-07T20:33:07.8495606Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.8495615Z 
2025-05-07T20:33:07.8495708Z moe/activation_test.py:126: 
2025-05-07T20:33:07.8495840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8495949Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.8496089Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8496705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.8496807Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.8497236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8497474Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8497864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.8498139Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8498568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.8498831Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8499238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.8499408Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.8499775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.8499853Z     fn()
2025-05-07T20:33:07.8500280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.8500365Z     self.fn.run(
2025-05-07T20:33:07.8500762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8500918Z     kernel = self.compile(
2025-05-07T20:33:07.8501330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8501505Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8501633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8501683Z 
2025-05-07T20:33:07.8501895Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8a06190>
2025-05-07T20:33:07.8502745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8503303Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b97ede50>}
2025-05-07T20:33:07.8504114Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8504315Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8918d70>
2025-05-07T20:33:07.8504320Z 
2025-05-07T20:33:07.8504490Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8504764Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8504879Z                            module_map=module_map)
2025-05-07T20:33:07.8505040Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8505143Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.8505214Z E       ^
2025-05-07T20:33:07.8505596Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8505604Z 
2025-05-07T20:33:07.8506055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8506060Z 
2025-05-07T20:33:07.8506159Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8506388Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8506473Z     T=1,
2025-05-07T20:33:07.8506549Z     D=5120,
2025-05-07T20:33:07.8506635Z     scale_ub=None,
2025-05-07T20:33:07.8506718Z     contiguous=True,
2025-05-07T20:33:07.8506802Z     compiled=False,
2025-05-07T20:33:07.8506920Z )
2025-05-07T20:33:07.8507146Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8507314Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.8507318Z 
2025-05-07T20:33:07.8507400Z     @given(
2025-05-07T20:33:07.8507520Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8507617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8507740Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8507856Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8507973Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8508047Z     )
2025-05-07T20:33:07.8508305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8508407Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8508482Z         self,
2025-05-07T20:33:07.8508555Z         T: int,
2025-05-07T20:33:07.8508636Z         D: int,
2025-05-07T20:33:07.8508738Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8508826Z         contiguous: bool,
2025-05-07T20:33:07.8508916Z         compiled: bool,
2025-05-07T20:33:07.8508991Z     ) -> None:
2025-05-07T20:33:07.8509086Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8509166Z     
2025-05-07T20:33:07.8509377Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8509495Z     
2025-05-07T20:33:07.8509585Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8509709Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8510017Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8510100Z         x0 = x[:, :D]
2025-05-07T20:33:07.8510178Z         x1 = x[:, D:]
2025-05-07T20:33:07.8510260Z     
2025-05-07T20:33:07.8510388Z         if contiguous:
2025-05-07T20:33:07.8510479Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8510572Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8510644Z     
2025-05-07T20:33:07.8510737Z         if scale_ub is not None:
2025-05-07T20:33:07.8510847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8510982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8511059Z             )
2025-05-07T20:33:07.8511134Z         else:
2025-05-07T20:33:07.8511227Z             scale_ub_tensor = None
2025-05-07T20:33:07.8511311Z     
2025-05-07T20:33:07.8511439Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8511531Z             op = silu_mul_quant
2025-05-07T20:33:07.8511623Z             if compiled:
2025-05-07T20:33:07.8511721Z                 op = torch.compile(op)
2025-05-07T20:33:07.8511825Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8511898Z     
2025-05-07T20:33:07.8511986Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8511994Z 
2025-05-07T20:33:07.8512088Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8512224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8512325Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8512428Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8512970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8513068Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8513458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8513691Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8514057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8514147Z     kernel = self.compile(
2025-05-07T20:33:07.8514557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8514785Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8514914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8514918Z 
2025-05-07T20:33:07.8515130Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b89f8640>
2025-05-07T20:33:07.8515985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8516536Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b906adc0>}
2025-05-07T20:33:07.8517355Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8517553Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8e1e370>
2025-05-07T20:33:07.8517558Z 
2025-05-07T20:33:07.8517736Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8518011Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8518157Z                            module_map=module_map)
2025-05-07T20:33:07.8518325Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8518466Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8518542Z E       ^
2025-05-07T20:33:07.8518928Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8518933Z 
2025-05-07T20:33:07.8519417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8519422Z 
2025-05-07T20:33:07.8519529Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8519761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8519836Z     T=128,
2025-05-07T20:33:07.8519921Z     D=5120,
2025-05-07T20:33:07.8520003Z     scale_ub=None,
2025-05-07T20:33:07.8520090Z     contiguous=False,
2025-05-07T20:33:07.8520179Z     compiled=True,
2025-05-07T20:33:07.8520252Z )
2025-05-07T20:33:07.8520481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8520662Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.8520666Z 
2025-05-07T20:33:07.8520740Z     @given(
2025-05-07T20:33:07.8520865Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8520961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8521079Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8521202Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8521312Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8521384Z     )
2025-05-07T20:33:07.8521649Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8521738Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8521817Z         self,
2025-05-07T20:33:07.8521892Z         T: int,
2025-05-07T20:33:07.8521966Z         D: int,
2025-05-07T20:33:07.8522072Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8522164Z         contiguous: bool,
2025-05-07T20:33:07.8522250Z         compiled: bool,
2025-05-07T20:33:07.8522333Z     ) -> None:
2025-05-07T20:33:07.8522426Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8522496Z     
2025-05-07T20:33:07.8522672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8522748Z     
2025-05-07T20:33:07.8522844Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8522972Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8523064Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8523197Z         x0 = x[:, :D]
2025-05-07T20:33:07.8523278Z         x1 = x[:, D:]
2025-05-07T20:33:07.8523351Z     
2025-05-07T20:33:07.8523437Z         if contiguous:
2025-05-07T20:33:07.8523526Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8523615Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8523690Z     
2025-05-07T20:33:07.8523783Z         if scale_ub is not None:
2025-05-07T20:33:07.8523889Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8524032Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8524107Z             )
2025-05-07T20:33:07.8524182Z         else:
2025-05-07T20:33:07.8524282Z             scale_ub_tensor = None
2025-05-07T20:33:07.8524354Z     
2025-05-07T20:33:07.8524486Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8524582Z             op = silu_mul_quant
2025-05-07T20:33:07.8524663Z             if compiled:
2025-05-07T20:33:07.8524767Z                 op = torch.compile(op)
2025-05-07T20:33:07.8524874Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8524947Z     
2025-05-07T20:33:07.8525045Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8525050Z 
2025-05-07T20:33:07.8525144Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8525275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8525425Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8525525Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8525960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8526057Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8526591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8526734Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8527118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8527349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8527718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8527810Z     kernel = self.compile(
2025-05-07T20:33:07.8528226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8528415Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8528543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8528547Z 
2025-05-07T20:33:07.8528757Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8edbee0>
2025-05-07T20:33:07.8529616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8530162Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b88a7040>}
2025-05-07T20:33:07.8530984Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8531182Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8f218f0>
2025-05-07T20:33:07.8531186Z 
2025-05-07T20:33:07.8531358Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8531632Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8531739Z                            module_map=module_map)
2025-05-07T20:33:07.8531907Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8532047Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8532124Z E       ^
2025-05-07T20:33:07.8532508Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8532513Z 
2025-05-07T20:33:07.8532957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8532963Z 
2025-05-07T20:33:07.8533071Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8533299Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8533376Z     T=128,
2025-05-07T20:33:07.8533457Z     D=7168,
2025-05-07T20:33:07.8533537Z     scale_ub=1200.0,
2025-05-07T20:33:07.8533624Z     contiguous=False,
2025-05-07T20:33:07.8533716Z     compiled=False,
2025-05-07T20:33:07.8533788Z )
2025-05-07T20:33:07.8534014Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8534200Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.8534205Z 
2025-05-07T20:33:07.8534278Z     @given(
2025-05-07T20:33:07.8534402Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8534500Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8534678Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8534802Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8534955Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8535025Z     )
2025-05-07T20:33:07.8535295Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8535386Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8535468Z         self,
2025-05-07T20:33:07.8535582Z         T: int,
2025-05-07T20:33:07.8535657Z         D: int,
2025-05-07T20:33:07.8535760Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8535846Z         contiguous: bool,
2025-05-07T20:33:07.8535935Z         compiled: bool,
2025-05-07T20:33:07.8536018Z     ) -> None:
2025-05-07T20:33:07.8536109Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8536178Z     
2025-05-07T20:33:07.8536356Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8536434Z     
2025-05-07T20:33:07.8536528Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8536657Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8536749Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8536834Z         x0 = x[:, :D]
2025-05-07T20:33:07.8536910Z         x1 = x[:, D:]
2025-05-07T20:33:07.8536981Z     
2025-05-07T20:33:07.8537071Z         if contiguous:
2025-05-07T20:33:07.8537159Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8537245Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8537328Z     
2025-05-07T20:33:07.8537416Z         if scale_ub is not None:
2025-05-07T20:33:07.8537518Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8537659Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8537734Z             )
2025-05-07T20:33:07.8537810Z         else:
2025-05-07T20:33:07.8537912Z             scale_ub_tensor = None
2025-05-07T20:33:07.8537987Z     
2025-05-07T20:33:07.8538114Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8538211Z             op = silu_mul_quant
2025-05-07T20:33:07.8538295Z             if compiled:
2025-05-07T20:33:07.8538402Z                 op = torch.compile(op)
2025-05-07T20:33:07.8538505Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8538577Z     
2025-05-07T20:33:07.8538671Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8538675Z 
2025-05-07T20:33:07.8538772Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8538903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8539010Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8539107Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8539691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8539795Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8540175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8540414Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8540776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8540867Z     kernel = self.compile(
2025-05-07T20:33:07.8541281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8541459Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8541601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8541607Z 
2025-05-07T20:33:07.8541819Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8f00400>
2025-05-07T20:33:07.8542706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8543261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b88a7c10>}
2025-05-07T20:33:07.8544114Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8544357Z context = <triton._C.libtriton.ir.context object at 0x7fd1b89024f0>
2025-05-07T20:33:07.8544361Z 
2025-05-07T20:33:07.8544533Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8544808Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8544922Z                            module_map=module_map)
2025-05-07T20:33:07.8545086Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8545191Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8545268Z E       ^
2025-05-07T20:33:07.8545652Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8545657Z 
2025-05-07T20:33:07.8546107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8546118Z 
2025-05-07T20:33:07.8546218Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8546456Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8546534Z     T=128,
2025-05-07T20:33:07.8546615Z     D=5120,
2025-05-07T20:33:07.8546704Z     scale_ub=None,
2025-05-07T20:33:07.8546789Z     contiguous=False,
2025-05-07T20:33:07.8546870Z     compiled=False,
2025-05-07T20:33:07.8546952Z )
2025-05-07T20:33:07.8547175Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8547351Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.8547357Z 
2025-05-07T20:33:07.8547440Z     @given(
2025-05-07T20:33:07.8547557Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8547662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8547776Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8547891Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8548011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8548082Z     )
2025-05-07T20:33:07.8548339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8548480Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8548559Z         self,
2025-05-07T20:33:07.8548635Z         T: int,
2025-05-07T20:33:07.8548715Z         D: int,
2025-05-07T20:33:07.8548812Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8548897Z         contiguous: bool,
2025-05-07T20:33:07.8548989Z         compiled: bool,
2025-05-07T20:33:07.8549066Z     ) -> None:
2025-05-07T20:33:07.8549165Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8549236Z     
2025-05-07T20:33:07.8549408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8549486Z     
2025-05-07T20:33:07.8549576Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8549703Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8549896Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8549975Z         x0 = x[:, :D]
2025-05-07T20:33:07.8550054Z         x1 = x[:, D:]
2025-05-07T20:33:07.8550130Z     
2025-05-07T20:33:07.8550210Z         if contiguous:
2025-05-07T20:33:07.8550301Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8550393Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8550467Z     
2025-05-07T20:33:07.8550556Z         if scale_ub is not None:
2025-05-07T20:33:07.8550666Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8550845Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8550927Z             )
2025-05-07T20:33:07.8551038Z         else:
2025-05-07T20:33:07.8551131Z             scale_ub_tensor = None
2025-05-07T20:33:07.8551208Z     
2025-05-07T20:33:07.8551335Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8551421Z             op = silu_mul_quant
2025-05-07T20:33:07.8551509Z             if compiled:
2025-05-07T20:33:07.8551645Z                 op = torch.compile(op)
2025-05-07T20:33:07.8551748Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8551825Z     
2025-05-07T20:33:07.8551914Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8551922Z 
2025-05-07T20:33:07.8552021Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8552152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8552250Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8552353Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8552898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8552995Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8553386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8553616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8553987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8554077Z     kernel = self.compile(
2025-05-07T20:33:07.8554486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8556142Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8556270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8556274Z 
2025-05-07T20:33:07.8556486Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8ce55b0>
2025-05-07T20:33:07.8557341Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8557887Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b887c310>}
2025-05-07T20:33:07.8558747Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8558944Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8876a30>
2025-05-07T20:33:07.8558948Z 
2025-05-07T20:33:07.8559122Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8559396Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8559504Z                            module_map=module_map)
2025-05-07T20:33:07.8559670Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8559767Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8559844Z E       ^
2025-05-07T20:33:07.8560231Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8560236Z 
2025-05-07T20:33:07.8560681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8560686Z 
2025-05-07T20:33:07.8560793Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8561024Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8561100Z     T=128,
2025-05-07T20:33:07.8561221Z     D=5120,
2025-05-07T20:33:07.8561304Z     scale_ub=1200.0,
2025-05-07T20:33:07.8561423Z     contiguous=True,
2025-05-07T20:33:07.8561511Z     compiled=False,
2025-05-07T20:33:07.8561583Z )
2025-05-07T20:33:07.8561818Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8561991Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.8561995Z 
2025-05-07T20:33:07.8562110Z     @given(
2025-05-07T20:33:07.8562236Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8562332Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8562449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8562570Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8562682Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8562751Z     )
2025-05-07T20:33:07.8563016Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8563116Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8563201Z         self,
2025-05-07T20:33:07.8563276Z         T: int,
2025-05-07T20:33:07.8563351Z         D: int,
2025-05-07T20:33:07.8563453Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8563540Z         contiguous: bool,
2025-05-07T20:33:07.8563624Z         compiled: bool,
2025-05-07T20:33:07.8563708Z     ) -> None:
2025-05-07T20:33:07.8563800Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8563875Z     
2025-05-07T20:33:07.8564052Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8564125Z     
2025-05-07T20:33:07.8564216Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8564346Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8564435Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8564522Z         x0 = x[:, :D]
2025-05-07T20:33:07.8564600Z         x1 = x[:, D:]
2025-05-07T20:33:07.8564671Z     
2025-05-07T20:33:07.8564758Z         if contiguous:
2025-05-07T20:33:07.8564849Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8564940Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8565019Z     
2025-05-07T20:33:07.8565107Z         if scale_ub is not None:
2025-05-07T20:33:07.8565210Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8565352Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8565425Z             )
2025-05-07T20:33:07.8565499Z         else:
2025-05-07T20:33:07.8565597Z             scale_ub_tensor = None
2025-05-07T20:33:07.8565667Z     
2025-05-07T20:33:07.8565798Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8565955Z             op = silu_mul_quant
2025-05-07T20:33:07.8566039Z             if compiled:
2025-05-07T20:33:07.8566143Z                 op = torch.compile(op)
2025-05-07T20:33:07.8566245Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8566316Z     
2025-05-07T20:33:07.8566409Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8566415Z 
2025-05-07T20:33:07.8566511Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8566643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8566750Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8566846Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8567390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8567487Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8567869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8568109Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8568469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8568560Z     kernel = self.compile(
2025-05-07T20:33:07.8569019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8569234Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8569369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8569374Z 
2025-05-07T20:33:07.8569583Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8873d90>
2025-05-07T20:33:07.8570467Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8571019Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b887cee0>}
2025-05-07T20:33:07.8571834Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8572038Z context = <triton._C.libtriton.ir.context object at 0x7fd1b88019f0>
2025-05-07T20:33:07.8572042Z 
2025-05-07T20:33:07.8572209Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8572490Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8572599Z                            module_map=module_map)
2025-05-07T20:33:07.8572760Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8572865Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8572939Z E       ^
2025-05-07T20:33:07.8573318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8573323Z 
2025-05-07T20:33:07.8573775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8573782Z 
2025-05-07T20:33:07.8573883Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8574122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8574197Z     T=1,
2025-05-07T20:33:07.8574271Z     D=7168,
2025-05-07T20:33:07.8574361Z     scale_ub=1200.0,
2025-05-07T20:33:07.8574442Z     contiguous=True,
2025-05-07T20:33:07.8574525Z     compiled=True,
2025-05-07T20:33:07.8574606Z )
2025-05-07T20:33:07.8574829Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8575038Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.8575053Z 
2025-05-07T20:33:07.8575131Z     @given(
2025-05-07T20:33:07.8575249Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8575353Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8575469Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8575584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8575705Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8575780Z     )
2025-05-07T20:33:07.8576040Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8576138Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8576211Z         self,
2025-05-07T20:33:07.8576288Z         T: int,
2025-05-07T20:33:07.8576370Z         D: int,
2025-05-07T20:33:07.8576465Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8576556Z         contiguous: bool,
2025-05-07T20:33:07.8576644Z         compiled: bool,
2025-05-07T20:33:07.8576720Z     ) -> None:
2025-05-07T20:33:07.8576816Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8576887Z     
2025-05-07T20:33:07.8577057Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8577133Z     
2025-05-07T20:33:07.8577224Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8577387Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8577519Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8577597Z         x0 = x[:, :D]
2025-05-07T20:33:07.8577676Z         x1 = x[:, D:]
2025-05-07T20:33:07.8577756Z     
2025-05-07T20:33:07.8577840Z         if contiguous:
2025-05-07T20:33:07.8577929Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8578021Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8578135Z     
2025-05-07T20:33:07.8578231Z         if scale_ub is not None:
2025-05-07T20:33:07.8578333Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8578471Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8578548Z             )
2025-05-07T20:33:07.8578622Z         else:
2025-05-07T20:33:07.8578719Z             scale_ub_tensor = None
2025-05-07T20:33:07.8578797Z     
2025-05-07T20:33:07.8578925Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8579015Z             op = silu_mul_quant
2025-05-07T20:33:07.8579105Z             if compiled:
2025-05-07T20:33:07.8579207Z                 op = torch.compile(op)
2025-05-07T20:33:07.8579313Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8579394Z     
2025-05-07T20:33:07.8579481Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8579485Z 
2025-05-07T20:33:07.8579585Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8579717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8579820Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8579925Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8580320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8580411Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8580955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8581051Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8581440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8581671Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8582032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8582134Z     kernel = self.compile(
2025-05-07T20:33:07.8582541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8583047Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8583252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8583259Z 
2025-05-07T20:33:07.8583494Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8c7cee0>
2025-05-07T20:33:07.8584355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8584907Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8cd4940>}
2025-05-07T20:33:07.8585739Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8585937Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8ded330>
2025-05-07T20:33:07.8585941Z 
2025-05-07T20:33:07.8586109Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8586518Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8586625Z                            module_map=module_map)
2025-05-07T20:33:07.8586855Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8586952Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8587024Z E       ^
2025-05-07T20:33:07.8587412Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8587494Z 
2025-05-07T20:33:07.8587940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8587944Z 
2025-05-07T20:33:07.8588053Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8588283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8588360Z     T=1,
2025-05-07T20:33:07.8588442Z     D=7168,
2025-05-07T20:33:07.8588523Z     scale_ub=1200.0,
2025-05-07T20:33:07.8588609Z     contiguous=False,
2025-05-07T20:33:07.8588701Z     compiled=True,
2025-05-07T20:33:07.8588773Z )
2025-05-07T20:33:07.8588997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8589178Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.8589182Z 
2025-05-07T20:33:07.8589259Z     @given(
2025-05-07T20:33:07.8589374Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8589477Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8589624Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8589890Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8590063Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8590174Z     )
2025-05-07T20:33:07.8590537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8590665Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8590771Z         self,
2025-05-07T20:33:07.8590883Z         T: int,
2025-05-07T20:33:07.8590993Z         D: int,
2025-05-07T20:33:07.8591127Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8591262Z         contiguous: bool,
2025-05-07T20:33:07.8591351Z         compiled: bool,
2025-05-07T20:33:07.8591437Z     ) -> None:
2025-05-07T20:33:07.8591530Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8591600Z     
2025-05-07T20:33:07.8591776Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8591852Z     
2025-05-07T20:33:07.8591940Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8592068Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8592154Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8592331Z         x0 = x[:, :D]
2025-05-07T20:33:07.8592416Z         x1 = x[:, D:]
2025-05-07T20:33:07.8592503Z     
2025-05-07T20:33:07.8598286Z         if contiguous:
2025-05-07T20:33:07.8598407Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8598498Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8598569Z     
2025-05-07T20:33:07.8598678Z         if scale_ub is not None:
2025-05-07T20:33:07.8598792Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8598939Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8599017Z             )
2025-05-07T20:33:07.8599096Z         else:
2025-05-07T20:33:07.8599200Z             scale_ub_tensor = None
2025-05-07T20:33:07.8599276Z     
2025-05-07T20:33:07.8599411Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8599514Z             op = silu_mul_quant
2025-05-07T20:33:07.8599602Z             if compiled:
2025-05-07T20:33:07.8599705Z                 op = torch.compile(op)
2025-05-07T20:33:07.8599823Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8599896Z     
2025-05-07T20:33:07.8599987Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8599992Z 
2025-05-07T20:33:07.8600102Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8600307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8600419Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8600565Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8600968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8601076Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8601615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8601783Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8602181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8602417Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8602790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8602889Z     kernel = self.compile(
2025-05-07T20:33:07.8603303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8603496Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8603632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8603636Z 
2025-05-07T20:33:07.8603858Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8dde520>
2025-05-07T20:33:07.8604721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8605271Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8bfb5e0>}
2025-05-07T20:33:07.8606102Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8606304Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8abc6b0>
2025-05-07T20:33:07.8606309Z 
2025-05-07T20:33:07.8606492Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8606772Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8606887Z                            module_map=module_map)
2025-05-07T20:33:07.8607106Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8607210Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8607304Z E       ^
2025-05-07T20:33:07.8607692Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8607698Z 
2025-05-07T20:33:07.8608148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8608155Z 
2025-05-07T20:33:07.8608274Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8608509Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8608589Z     T=1,
2025-05-07T20:33:07.8608677Z     D=7168,
2025-05-07T20:33:07.8608761Z     scale_ub=None,
2025-05-07T20:33:07.8608864Z     contiguous=False,
2025-05-07T20:33:07.8608948Z     compiled=True,
2025-05-07T20:33:07.8609023Z )
2025-05-07T20:33:07.8609261Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8609434Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.8609439Z 
2025-05-07T20:33:07.8609522Z     @given(
2025-05-07T20:33:07.8609653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8609753Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8609913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8610081Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8610196Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8610282Z     )
2025-05-07T20:33:07.8610542Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8610638Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8610764Z         self,
2025-05-07T20:33:07.8610844Z         T: int,
2025-05-07T20:33:07.8610920Z         D: int,
2025-05-07T20:33:07.8611033Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8611121Z         contiguous: bool,
2025-05-07T20:33:07.8611208Z         compiled: bool,
2025-05-07T20:33:07.8611292Z     ) -> None:
2025-05-07T20:33:07.8611388Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8611459Z     
2025-05-07T20:33:07.8611650Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8611725Z     
2025-05-07T20:33:07.8611826Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8611948Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8612049Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8612128Z         x0 = x[:, :D]
2025-05-07T20:33:07.8612210Z         x1 = x[:, D:]
2025-05-07T20:33:07.8612293Z     
2025-05-07T20:33:07.8612376Z         if contiguous:
2025-05-07T20:33:07.8612471Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8612571Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8612645Z     
2025-05-07T20:33:07.8612735Z         if scale_ub is not None:
2025-05-07T20:33:07.8612848Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8612987Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8613065Z             )
2025-05-07T20:33:07.8613150Z         else:
2025-05-07T20:33:07.8613244Z             scale_ub_tensor = None
2025-05-07T20:33:07.8613324Z     
2025-05-07T20:33:07.8613456Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8613547Z             op = silu_mul_quant
2025-05-07T20:33:07.8613643Z             if compiled:
2025-05-07T20:33:07.8613744Z                 op = torch.compile(op)
2025-05-07T20:33:07.8613848Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8613929Z     
2025-05-07T20:33:07.8614020Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.8614143Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.8614225Z     
2025-05-07T20:33:07.8614361Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8614464Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.8614618Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.8614746Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.8614894Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8614964Z     
2025-05-07T20:33:07.8615064Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.8615069Z 
2025-05-07T20:33:07.8615176Z moe/activation_test.py:126: 
2025-05-07T20:33:07.8615311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8615415Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.8615556Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.8616171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.8616281Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.8616671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8616906Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8617309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.8617619Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8618091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.8618359Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.8618765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.8618985Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.8619355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.8619435Z     fn()
2025-05-07T20:33:07.8619872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.8619954Z     self.fn.run(
2025-05-07T20:33:07.8620323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8620421Z     kernel = self.compile(
2025-05-07T20:33:07.8620829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8621014Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8621143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8621151Z 
2025-05-07T20:33:07.8621364Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8a994c0>
2025-05-07T20:33:07.8622229Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8622810Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8a4b160>}
2025-05-07T20:33:07.8623657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8623856Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8a476f0>
2025-05-07T20:33:07.8623861Z 
2025-05-07T20:33:07.8624046Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8624323Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8624470Z                            module_map=module_map)
2025-05-07T20:33:07.8624643Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8624747Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.8624821Z E       ^
2025-05-07T20:33:07.8625213Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8625217Z 
2025-05-07T20:33:07.8625666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8625670Z 
2025-05-07T20:33:07.8625780Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8626011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8626097Z     T=1,
2025-05-07T20:33:07.8626181Z     D=5120,
2025-05-07T20:33:07.8626267Z     scale_ub=1200.0,
2025-05-07T20:33:07.8626356Z     contiguous=False,
2025-05-07T20:33:07.8626450Z     compiled=True,
2025-05-07T20:33:07.8626525Z )
2025-05-07T20:33:07.8626762Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8626933Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.8626938Z 
2025-05-07T20:33:07.8627012Z     @given(
2025-05-07T20:33:07.8627233Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8627335Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8627491Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8627617Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8627730Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8627808Z     )
2025-05-07T20:33:07.8628077Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8628214Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8628296Z         self,
2025-05-07T20:33:07.8628373Z         T: int,
2025-05-07T20:33:07.8628450Z         D: int,
2025-05-07T20:33:07.8628562Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8628651Z         contiguous: bool,
2025-05-07T20:33:07.8628736Z         compiled: bool,
2025-05-07T20:33:07.8628825Z     ) -> None:
2025-05-07T20:33:07.8628919Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8628992Z     
2025-05-07T20:33:07.8629179Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8629258Z     
2025-05-07T20:33:07.8629349Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8629479Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8629567Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8629653Z         x0 = x[:, :D]
2025-05-07T20:33:07.8629732Z         x1 = x[:, D:]
2025-05-07T20:33:07.8629965Z     
2025-05-07T20:33:07.8630060Z         if contiguous:
2025-05-07T20:33:07.8630150Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8630239Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8630317Z     
2025-05-07T20:33:07.8630408Z         if scale_ub is not None:
2025-05-07T20:33:07.8630513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8630656Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8630732Z             )
2025-05-07T20:33:07.8630809Z         else:
2025-05-07T20:33:07.8630910Z             scale_ub_tensor = None
2025-05-07T20:33:07.8630985Z     
2025-05-07T20:33:07.8631125Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8631217Z             op = silu_mul_quant
2025-05-07T20:33:07.8631300Z             if compiled:
2025-05-07T20:33:07.8631406Z                 op = torch.compile(op)
2025-05-07T20:33:07.8631511Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8631585Z     
2025-05-07T20:33:07.8631683Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8631690Z 
2025-05-07T20:33:07.8631786Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8631917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8632078Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8632179Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8632581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8632673Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8633215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8633321Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8633703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8633936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8634307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8634401Z     kernel = self.compile(
2025-05-07T20:33:07.8634821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8634999Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8635130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8635135Z 
2025-05-07T20:33:07.8635393Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8a7bee0>
2025-05-07T20:33:07.8636311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8636868Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8a4bb80>}
2025-05-07T20:33:07.8637721Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8637918Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8b76830>
2025-05-07T20:33:07.8637929Z 
2025-05-07T20:33:07.8638102Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8638377Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8638492Z                            module_map=module_map)
2025-05-07T20:33:07.8638653Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8638754Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8638839Z E       ^
2025-05-07T20:33:07.8639226Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8639231Z 
2025-05-07T20:33:07.8639684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8639689Z 
2025-05-07T20:33:07.8639793Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8640025Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8640110Z     T=1,
2025-05-07T20:33:07.8640188Z     D=5120,
2025-05-07T20:33:07.8640272Z     scale_ub=1200.0,
2025-05-07T20:33:07.8640366Z     contiguous=False,
2025-05-07T20:33:07.8640450Z     compiled=False,
2025-05-07T20:33:07.8640524Z )
2025-05-07T20:33:07.8640756Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8640927Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.8640931Z 
2025-05-07T20:33:07.8641015Z     @given(
2025-05-07T20:33:07.8641133Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8641232Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8641397Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8641516Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8641629Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8641711Z     )
2025-05-07T20:33:07.8641968Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8642064Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8642151Z         self,
2025-05-07T20:33:07.8642229Z         T: int,
2025-05-07T20:33:07.8642310Z         D: int,
2025-05-07T20:33:07.8642408Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8642496Z         contiguous: bool,
2025-05-07T20:33:07.8642588Z         compiled: bool,
2025-05-07T20:33:07.8642667Z     ) -> None:
2025-05-07T20:33:07.8642762Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8642845Z     
2025-05-07T20:33:07.8643016Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8643093Z     
2025-05-07T20:33:07.8643196Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8643319Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8643411Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8643497Z         x0 = x[:, :D]
2025-05-07T20:33:07.8643577Z         x1 = x[:, D:]
2025-05-07T20:33:07.8643658Z     
2025-05-07T20:33:07.8643741Z         if contiguous:
2025-05-07T20:33:07.8643878Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8644009Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8644087Z     
2025-05-07T20:33:07.8644181Z         if scale_ub is not None:
2025-05-07T20:33:07.8644293Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8644429Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8644507Z             )
2025-05-07T20:33:07.8644631Z         else:
2025-05-07T20:33:07.8644726Z             scale_ub_tensor = None
2025-05-07T20:33:07.8644799Z     
2025-05-07T20:33:07.8644936Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8645030Z             op = silu_mul_quant
2025-05-07T20:33:07.8645113Z             if compiled:
2025-05-07T20:33:07.8645221Z                 op = torch.compile(op)
2025-05-07T20:33:07.8645325Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8645405Z     
2025-05-07T20:33:07.8645496Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8645504Z 
2025-05-07T20:33:07.8645600Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8645740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8645842Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8645940Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8646485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8646583Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8646970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8647207Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8647569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8647666Z     kernel = self.compile(
2025-05-07T20:33:07.8648076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8648257Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8648395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8648400Z 
2025-05-07T20:33:07.8648609Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8a143d0>
2025-05-07T20:33:07.8649539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8650090Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8b57550>}
2025-05-07T20:33:07.8650914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8651115Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8a690f0>
2025-05-07T20:33:07.8651120Z 
2025-05-07T20:33:07.8651288Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8651577Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8651688Z                            module_map=module_map)
2025-05-07T20:33:07.8651857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8651958Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8652032Z E       ^
2025-05-07T20:33:07.8652420Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8652425Z 
2025-05-07T20:33:07.8652910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8652952Z 
2025-05-07T20:33:07.8653057Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8653297Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8653373Z     T=16384,
2025-05-07T20:33:07.8653456Z     D=5120,
2025-05-07T20:33:07.8653538Z     scale_ub=1200.0,
2025-05-07T20:33:07.8653625Z     contiguous=False,
2025-05-07T20:33:07.8653753Z     compiled=True,
2025-05-07T20:33:07.8653826Z )
2025-05-07T20:33:07.8654053Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8654255Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.8654259Z 
2025-05-07T20:33:07.8654335Z     @given(
2025-05-07T20:33:07.8654454Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8654560Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8654678Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8654801Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8654918Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8654995Z     )
2025-05-07T20:33:07.8655262Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8655355Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8655431Z         self,
2025-05-07T20:33:07.8655521Z         T: int,
2025-05-07T20:33:07.8655601Z         D: int,
2025-05-07T20:33:07.8655700Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8655794Z         contiguous: bool,
2025-05-07T20:33:07.8655882Z         compiled: bool,
2025-05-07T20:33:07.8655960Z     ) -> None:
2025-05-07T20:33:07.8656061Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8656134Z     
2025-05-07T20:33:07.8656312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8656388Z     
2025-05-07T20:33:07.8656483Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8656615Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8656708Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8656788Z         x0 = x[:, :D]
2025-05-07T20:33:07.8656876Z         x1 = x[:, D:]
2025-05-07T20:33:07.8656948Z     
2025-05-07T20:33:07.8657031Z         if contiguous:
2025-05-07T20:33:07.8657128Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8657217Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8657291Z     
2025-05-07T20:33:07.8657390Z         if scale_ub is not None:
2025-05-07T20:33:07.8657497Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8657688Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8657766Z             )
2025-05-07T20:33:07.8657844Z         else:
2025-05-07T20:33:07.8657942Z             scale_ub_tensor = None
2025-05-07T20:33:07.8658013Z     
2025-05-07T20:33:07.8658142Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8658240Z             op = silu_mul_quant
2025-05-07T20:33:07.8658328Z             if compiled:
2025-05-07T20:33:07.8658431Z                 op = torch.compile(op)
2025-05-07T20:33:07.8658542Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8658616Z     
2025-05-07T20:33:07.8658708Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8658720Z 
2025-05-07T20:33:07.8658815Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8658947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8659054Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8659152Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8659546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8659645Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8660183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8660320Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8660748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8660981Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8661345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8661494Z     kernel = self.compile(
2025-05-07T20:33:07.8661903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8662085Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8662224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8662228Z 
2025-05-07T20:33:07.8662441Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8a858e0>
2025-05-07T20:33:07.8663297Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8663846Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b84dd1f0>}
2025-05-07T20:33:07.8664670Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8664866Z context = <triton._C.libtriton.ir.context object at 0x7fd1b84db130>
2025-05-07T20:33:07.8664870Z 
2025-05-07T20:33:07.8665039Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8665326Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8665433Z                            module_map=module_map)
2025-05-07T20:33:07.8665603Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8665702Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8665776Z E       ^
2025-05-07T20:33:07.8666165Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8666172Z 
2025-05-07T20:33:07.8666618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8666623Z 
2025-05-07T20:33:07.8666766Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8667008Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8667083Z     T=2048,
2025-05-07T20:33:07.8667168Z     D=7168,
2025-05-07T20:33:07.8667250Z     scale_ub=1200.0,
2025-05-07T20:33:07.8667338Z     contiguous=False,
2025-05-07T20:33:07.8667429Z     compiled=True,
2025-05-07T20:33:07.8667503Z )
2025-05-07T20:33:07.8667732Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8667918Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.8667923Z 
2025-05-07T20:33:07.8667999Z     @given(
2025-05-07T20:33:07.8668118Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8668229Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8668346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8668470Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8668585Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8668660Z     )
2025-05-07T20:33:07.8668926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8669019Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8669092Z         self,
2025-05-07T20:33:07.8669174Z         T: int,
2025-05-07T20:33:07.8669321Z         D: int,
2025-05-07T20:33:07.8669420Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8669556Z         contiguous: bool,
2025-05-07T20:33:07.8669641Z         compiled: bool,
2025-05-07T20:33:07.8669717Z     ) -> None:
2025-05-07T20:33:07.8669935Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8670008Z     
2025-05-07T20:33:07.8670185Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8670305Z     
2025-05-07T20:33:07.8670397Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8670528Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8670618Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8670698Z         x0 = x[:, :D]
2025-05-07T20:33:07.8670787Z         x1 = x[:, D:]
2025-05-07T20:33:07.8670860Z     
2025-05-07T20:33:07.8670944Z         if contiguous:
2025-05-07T20:33:07.8671042Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8671131Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8671208Z     
2025-05-07T20:33:07.8671306Z         if scale_ub is not None:
2025-05-07T20:33:07.8671414Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8671555Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8671631Z             )
2025-05-07T20:33:07.8671708Z         else:
2025-05-07T20:33:07.8671809Z             scale_ub_tensor = None
2025-05-07T20:33:07.8671880Z     
2025-05-07T20:33:07.8672014Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8672112Z             op = silu_mul_quant
2025-05-07T20:33:07.8672199Z             if compiled:
2025-05-07T20:33:07.8672298Z                 op = torch.compile(op)
2025-05-07T20:33:07.8672411Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8672484Z     
2025-05-07T20:33:07.8672574Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8672586Z 
2025-05-07T20:33:07.8672682Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8672819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8672929Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8673030Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8673421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8673519Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8674058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8674159Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8674590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8674826Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8675195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8675293Z     kernel = self.compile(
2025-05-07T20:33:07.8675703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8675893Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8676022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8676027Z 
2025-05-07T20:33:07.8676244Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b84f79d0>
2025-05-07T20:33:07.8677096Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8677643Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b84ddee0>}
2025-05-07T20:33:07.8678502Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8678737Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8ae6ef0>
2025-05-07T20:33:07.8678742Z 
2025-05-07T20:33:07.8678918Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8679237Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8679345Z                            module_map=module_map)
2025-05-07T20:33:07.8679517Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8679618Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8679704Z E       ^
2025-05-07T20:33:07.8680087Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8680091Z 
2025-05-07T20:33:07.8680539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8680546Z 
2025-05-07T20:33:07.8680659Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8680889Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8680976Z     T=1,
2025-05-07T20:33:07.8681054Z     D=5120,
2025-05-07T20:33:07.8681139Z     scale_ub=None,
2025-05-07T20:33:07.8681234Z     contiguous=False,
2025-05-07T20:33:07.8681318Z     compiled=False,
2025-05-07T20:33:07.8681393Z )
2025-05-07T20:33:07.8681628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8681801Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.8681806Z 
2025-05-07T20:33:07.8681882Z     @given(
2025-05-07T20:33:07.8682014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8682113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8682232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8682362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8682474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8682553Z     )
2025-05-07T20:33:07.8683230Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8683365Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8683464Z         self,
2025-05-07T20:33:07.8683543Z         T: int,
2025-05-07T20:33:07.8683625Z         D: int,
2025-05-07T20:33:07.8683733Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8684012Z         contiguous: bool,
2025-05-07T20:33:07.8684100Z         compiled: bool,
2025-05-07T20:33:07.8684187Z     ) -> None:
2025-05-07T20:33:07.8684280Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8684354Z     
2025-05-07T20:33:07.8684533Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8684606Z     
2025-05-07T20:33:07.8684707Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8684833Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8684919Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8685004Z         x0 = x[:, :D]
2025-05-07T20:33:07.8685084Z         x1 = x[:, D:]
2025-05-07T20:33:07.8685158Z     
2025-05-07T20:33:07.8685246Z         if contiguous:
2025-05-07T20:33:07.8685336Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8685428Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8685503Z     
2025-05-07T20:33:07.8685594Z         if scale_ub is not None:
2025-05-07T20:33:07.8685701Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8685843Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8685918Z             )
2025-05-07T20:33:07.8686000Z         else:
2025-05-07T20:33:07.8686093Z             scale_ub_tensor = None
2025-05-07T20:33:07.8686167Z     
2025-05-07T20:33:07.8686369Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8686459Z             op = silu_mul_quant
2025-05-07T20:33:07.8686602Z             if compiled:
2025-05-07T20:33:07.8686707Z                 op = torch.compile(op)
2025-05-07T20:33:07.8686812Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8686886Z     
2025-05-07T20:33:07.8686985Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8686989Z 
2025-05-07T20:33:07.8687085Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8687284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8687390Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8687491Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8688041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8688138Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8688524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8688765Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8689128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8689229Z     kernel = self.compile(
2025-05-07T20:33:07.8689639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8689821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8689958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8689963Z 
2025-05-07T20:33:07.8690172Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8b0b580>
2025-05-07T20:33:07.8691024Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8691574Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8ad85e0>}
2025-05-07T20:33:07.8692386Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8692593Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8b21830>
2025-05-07T20:33:07.8692597Z 
2025-05-07T20:33:07.8692811Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8693099Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8693206Z                            module_map=module_map)
2025-05-07T20:33:07.8693371Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8693478Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8693556Z E       ^
2025-05-07T20:33:07.8693938Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8693951Z 
2025-05-07T20:33:07.8694395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8694402Z 
2025-05-07T20:33:07.8694505Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8694744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8694826Z     T=4096,
2025-05-07T20:33:07.8694904Z     D=7168,
2025-05-07T20:33:07.8694992Z     scale_ub=1200.0,
2025-05-07T20:33:07.8695080Z     contiguous=False,
2025-05-07T20:33:07.8695164Z     compiled=False,
2025-05-07T20:33:07.8695247Z )
2025-05-07T20:33:07.8695515Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8695708Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.8695750Z 
2025-05-07T20:33:07.8695829Z     @given(
2025-05-07T20:33:07.8695951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8696057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8696173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8696331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8696452Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8696529Z     )
2025-05-07T20:33:07.8696790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8696894Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8696976Z         self,
2025-05-07T20:33:07.8697065Z         T: int,
2025-05-07T20:33:07.8697143Z         D: int,
2025-05-07T20:33:07.8697241Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8697347Z         contiguous: bool,
2025-05-07T20:33:07.8697436Z         compiled: bool,
2025-05-07T20:33:07.8697516Z     ) -> None:
2025-05-07T20:33:07.8697619Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8697695Z     
2025-05-07T20:33:07.8697869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8697952Z     
2025-05-07T20:33:07.8698046Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8698170Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8698270Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8698352Z         x0 = x[:, :D]
2025-05-07T20:33:07.8698439Z         x1 = x[:, D:]
2025-05-07T20:33:07.8698512Z     
2025-05-07T20:33:07.8698598Z         if contiguous:
2025-05-07T20:33:07.8698697Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8698787Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8698861Z     
2025-05-07T20:33:07.8698959Z         if scale_ub is not None:
2025-05-07T20:33:07.8699063Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8699201Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8699290Z             )
2025-05-07T20:33:07.8699367Z         else:
2025-05-07T20:33:07.8699460Z             scale_ub_tensor = None
2025-05-07T20:33:07.8699540Z     
2025-05-07T20:33:07.8699669Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8699758Z             op = silu_mul_quant
2025-05-07T20:33:07.8699852Z             if compiled:
2025-05-07T20:33:07.8699955Z                 op = torch.compile(op)
2025-05-07T20:33:07.8700068Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8700139Z     
2025-05-07T20:33:07.8700311Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8700316Z 
2025-05-07T20:33:07.8700421Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8700553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8700655Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8700766Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8701304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8701410Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8701792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8702024Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8702395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8702491Z     kernel = self.compile(
2025-05-07T20:33:07.8702902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8703088Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8703285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8703290Z 
2025-05-07T20:33:07.8703545Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8a835e0>
2025-05-07T20:33:07.8704394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8704980Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8c291f0>}
2025-05-07T20:33:07.8705802Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8706002Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8c27bb0>
2025-05-07T20:33:07.8706006Z 
2025-05-07T20:33:07.8706184Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8706467Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8706582Z                            module_map=module_map)
2025-05-07T20:33:07.8706743Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8706843Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8706932Z E       ^
2025-05-07T20:33:07.8707309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8707314Z 
2025-05-07T20:33:07.8707762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8707767Z 
2025-05-07T20:33:07.8707873Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8708106Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8708196Z     T=16384,
2025-05-07T20:33:07.8708273Z     D=7168,
2025-05-07T20:33:07.8708357Z     scale_ub=None,
2025-05-07T20:33:07.8708448Z     contiguous=True,
2025-05-07T20:33:07.8708530Z     compiled=True,
2025-05-07T20:33:07.8708604Z )
2025-05-07T20:33:07.8708836Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8709015Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.8709022Z 
2025-05-07T20:33:07.8709097Z     @given(
2025-05-07T20:33:07.8709225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8709323Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8709490Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8709609Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8709723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8709932Z     )
2025-05-07T20:33:07.8710196Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8710290Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8710376Z         self,
2025-05-07T20:33:07.8710453Z         T: int,
2025-05-07T20:33:07.8710531Z         D: int,
2025-05-07T20:33:07.8710641Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8710729Z         contiguous: bool,
2025-05-07T20:33:07.8710813Z         compiled: bool,
2025-05-07T20:33:07.8710904Z     ) -> None:
2025-05-07T20:33:07.8710998Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8711078Z     
2025-05-07T20:33:07.8711250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8711324Z     
2025-05-07T20:33:07.8711425Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8711551Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8711641Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8711728Z         x0 = x[:, :D]
2025-05-07T20:33:07.8711809Z         x1 = x[:, D:]
2025-05-07T20:33:07.8711881Z     
2025-05-07T20:33:07.8712017Z         if contiguous:
2025-05-07T20:33:07.8712903Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8712992Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8713078Z     
2025-05-07T20:33:07.8713170Z         if scale_ub is not None:
2025-05-07T20:33:07.8713278Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8713425Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8713546Z             )
2025-05-07T20:33:07.8713630Z         else:
2025-05-07T20:33:07.8713725Z             scale_ub_tensor = None
2025-05-07T20:33:07.8713800Z     
2025-05-07T20:33:07.8713940Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8714033Z             op = silu_mul_quant
2025-05-07T20:33:07.8714118Z             if compiled:
2025-05-07T20:33:07.8714224Z                 op = torch.compile(op)
2025-05-07T20:33:07.8714330Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8714404Z     
2025-05-07T20:33:07.8714505Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8714510Z 
2025-05-07T20:33:07.8714610Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8714750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8714851Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8714951Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8715352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8715449Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8715987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8716090Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8716471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8716711Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8717073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8717169Z     kernel = self.compile(
2025-05-07T20:33:07.8717586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8717764Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8717897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8717907Z 
2025-05-07T20:33:07.8718169Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8c13100>
2025-05-07T20:33:07.8719021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8719582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8c29ee0>}
2025-05-07T20:33:07.8720396Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8720595Z context = <triton._C.libtriton.ir.context object at 0x7fd1b895bbf0>
2025-05-07T20:33:07.8720602Z 
2025-05-07T20:33:07.8720771Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8721048Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8721163Z                            module_map=module_map)
2025-05-07T20:33:07.8721325Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8721435Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8721511Z E       ^
2025-05-07T20:33:07.8721935Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8721978Z 
2025-05-07T20:33:07.8722432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8722437Z 
2025-05-07T20:33:07.8722538Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8722813Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8722896Z     T=4096,
2025-05-07T20:33:07.8722976Z     D=5120,
2025-05-07T20:33:07.8723062Z     scale_ub=None,
2025-05-07T20:33:07.8723153Z     contiguous=False,
2025-05-07T20:33:07.8723238Z     compiled=True,
2025-05-07T20:33:07.8723316Z )
2025-05-07T20:33:07.8723543Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8723720Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.8723724Z 
2025-05-07T20:33:07.8723809Z     @given(
2025-05-07T20:33:07.8723928Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8724026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8724150Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8724266Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8724385Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8724459Z     )
2025-05-07T20:33:07.8724716Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8724814Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8724891Z         self,
2025-05-07T20:33:07.8724968Z         T: int,
2025-05-07T20:33:07.8725050Z         D: int,
2025-05-07T20:33:07.8725148Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8725237Z         contiguous: bool,
2025-05-07T20:33:07.8725332Z         compiled: bool,
2025-05-07T20:33:07.8725408Z     ) -> None:
2025-05-07T20:33:07.8725503Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8725584Z     
2025-05-07T20:33:07.8725758Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8725842Z     
2025-05-07T20:33:07.8725935Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8726073Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8730797Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8730901Z         x0 = x[:, :D]
2025-05-07T20:33:07.8730993Z         x1 = x[:, D:]
2025-05-07T20:33:07.8731078Z     
2025-05-07T20:33:07.8731167Z         if contiguous:
2025-05-07T20:33:07.8731261Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8731448Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8731524Z     
2025-05-07T20:33:07.8731622Z         if scale_ub is not None:
2025-05-07T20:33:07.8731743Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8731881Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8731961Z             )
2025-05-07T20:33:07.8732051Z         else:
2025-05-07T20:33:07.8732147Z             scale_ub_tensor = None
2025-05-07T20:33:07.8732236Z     
2025-05-07T20:33:07.8732373Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8732468Z             op = silu_mul_quant
2025-05-07T20:33:07.8732569Z             if compiled:
2025-05-07T20:33:07.8732672Z                 op = torch.compile(op)
2025-05-07T20:33:07.8732780Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8732865Z     
2025-05-07T20:33:07.8732957Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8732962Z 
2025-05-07T20:33:07.8733062Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8733211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8733313Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8733424Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8733882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8733981Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8734573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8734677Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8735063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8735387Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8735755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8735863Z     kernel = self.compile(
2025-05-07T20:33:07.8736282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8736463Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8736610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8736618Z 
2025-05-07T20:33:07.8736835Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b897b5b0>
2025-05-07T20:33:07.8737698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8738252Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8987940>}
2025-05-07T20:33:07.8739070Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8739279Z context = <triton._C.libtriton.ir.context object at 0x7fd1b86a3eb0>
2025-05-07T20:33:07.8739283Z 
2025-05-07T20:33:07.8739457Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8739743Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8739855Z                            module_map=module_map)
2025-05-07T20:33:07.8740023Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8740136Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8740221Z E       ^
2025-05-07T20:33:07.8740606Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8740664Z 
2025-05-07T20:33:07.8741117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8741122Z 
2025-05-07T20:33:07.8741226Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8741472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8741552Z     T=4096,
2025-05-07T20:33:07.8741634Z     D=5120,
2025-05-07T20:33:07.8741726Z     scale_ub=1200.0,
2025-05-07T20:33:07.8741816Z     contiguous=False,
2025-05-07T20:33:07.8741903Z     compiled=False,
2025-05-07T20:33:07.8741988Z )
2025-05-07T20:33:07.8742217Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8742408Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.8742416Z 
2025-05-07T20:33:07.8742495Z     @given(
2025-05-07T20:33:07.8742617Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8742728Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8742846Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8742965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8743089Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8743167Z     )
2025-05-07T20:33:07.8743479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8743610Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8743691Z         self,
2025-05-07T20:33:07.8743776Z         T: int,
2025-05-07T20:33:07.8743855Z         D: int,
2025-05-07T20:33:07.8743955Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8744052Z         contiguous: bool,
2025-05-07T20:33:07.8744138Z         compiled: bool,
2025-05-07T20:33:07.8744260Z     ) -> None:
2025-05-07T20:33:07.8744364Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8744438Z     
2025-05-07T20:33:07.8744625Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8744702Z     
2025-05-07T20:33:07.8744798Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8744929Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8745021Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8745115Z         x0 = x[:, :D]
2025-05-07T20:33:07.8745201Z         x1 = x[:, D:]
2025-05-07T20:33:07.8745279Z     
2025-05-07T20:33:07.8745376Z         if contiguous:
2025-05-07T20:33:07.8745468Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8745559Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8745638Z     
2025-05-07T20:33:07.8745730Z         if scale_ub is not None:
2025-05-07T20:33:07.8745835Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8745979Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8746059Z             )
2025-05-07T20:33:07.8746134Z         else:
2025-05-07T20:33:07.8746240Z             scale_ub_tensor = None
2025-05-07T20:33:07.8746315Z     
2025-05-07T20:33:07.8746448Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8746546Z             op = silu_mul_quant
2025-05-07T20:33:07.8746634Z             if compiled:
2025-05-07T20:33:07.8746746Z                 op = torch.compile(op)
2025-05-07T20:33:07.8746851Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8746930Z     
2025-05-07T20:33:07.8747025Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8747032Z 
2025-05-07T20:33:07.8747130Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8747263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8747376Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8747478Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8748021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8748126Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8748559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8748803Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8749168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8749262Z     kernel = self.compile(
2025-05-07T20:33:07.8749681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8750005Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8750143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8750150Z 
2025-05-07T20:33:07.8750363Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8985df0>
2025-05-07T20:33:07.8751215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8751809Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b872f3a0>}
2025-05-07T20:33:07.8752625Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8752866Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8736e70>
2025-05-07T20:33:07.8752871Z 
2025-05-07T20:33:07.8753041Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8753358Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8753476Z                            module_map=module_map)
2025-05-07T20:33:07.8753640Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8753748Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8753822Z E       ^
2025-05-07T20:33:07.8754206Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8754213Z 
2025-05-07T20:33:07.8754666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8754674Z 
2025-05-07T20:33:07.8754777Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8755014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8755094Z     T=4096,
2025-05-07T20:33:07.8755176Z     D=5120,
2025-05-07T20:33:07.8755266Z     scale_ub=1200.0,
2025-05-07T20:33:07.8755352Z     contiguous=False,
2025-05-07T20:33:07.8755438Z     compiled=True,
2025-05-07T20:33:07.8755519Z )
2025-05-07T20:33:07.8755748Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8755929Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.8755934Z 
2025-05-07T20:33:07.8756018Z     @given(
2025-05-07T20:33:07.8756138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8756246Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8756366Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8756488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8756606Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8756683Z     )
2025-05-07T20:33:07.8756942Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8757047Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8757125Z         self,
2025-05-07T20:33:07.8757205Z         T: int,
2025-05-07T20:33:07.8757289Z         D: int,
2025-05-07T20:33:07.8757437Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8757529Z         contiguous: bool,
2025-05-07T20:33:07.8757622Z         compiled: bool,
2025-05-07T20:33:07.8757699Z     ) -> None:
2025-05-07T20:33:07.8757802Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8757877Z     
2025-05-07T20:33:07.8758055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8758135Z     
2025-05-07T20:33:07.8758230Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8758354Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8758451Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8758533Z         x0 = x[:, :D]
2025-05-07T20:33:07.8758616Z         x1 = x[:, D:]
2025-05-07T20:33:07.8758699Z     
2025-05-07T20:33:07.8758783Z         if contiguous:
2025-05-07T20:33:07.8758878Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8758975Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8759052Z     
2025-05-07T20:33:07.8759152Z         if scale_ub is not None:
2025-05-07T20:33:07.8759261Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8759398Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8759479Z             )
2025-05-07T20:33:07.8759555Z         else:
2025-05-07T20:33:07.8759650Z             scale_ub_tensor = None
2025-05-07T20:33:07.8759733Z     
2025-05-07T20:33:07.8759909Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8760037Z             op = silu_mul_quant
2025-05-07T20:33:07.8760129Z             if compiled:
2025-05-07T20:33:07.8760231Z                 op = torch.compile(op)
2025-05-07T20:33:07.8760337Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8760418Z     
2025-05-07T20:33:07.8760510Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8760553Z 
2025-05-07T20:33:07.8760656Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8760787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8760893Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8761000Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8761395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8761490Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8762035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8762136Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8762526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8762760Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8763130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8763234Z     kernel = self.compile(
2025-05-07T20:33:07.8763645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8763825Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8763959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8763964Z 
2025-05-07T20:33:07.8764176Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8728970>
2025-05-07T20:33:07.8765031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8765579Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b872f280>}
2025-05-07T20:33:07.8766502Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8766700Z context = <triton._C.libtriton.ir.context object at 0x7fd1b858be70>
2025-05-07T20:33:07.8766705Z 
2025-05-07T20:33:07.8766875Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8767159Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8767271Z                            module_map=module_map)
2025-05-07T20:33:07.8767442Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8767543Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8767622Z E       ^
2025-05-07T20:33:07.8768011Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8768018Z 
2025-05-07T20:33:07.8768466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8768471Z 
2025-05-07T20:33:07.8768575Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8768814Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8768892Z     T=2048,
2025-05-07T20:33:07.8768974Z     D=7168,
2025-05-07T20:33:07.8769097Z     scale_ub=1200.0,
2025-05-07T20:33:07.8769184Z     contiguous=False,
2025-05-07T20:33:07.8769344Z     compiled=False,
2025-05-07T20:33:07.8769419Z )
2025-05-07T20:33:07.8769646Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8769835Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.8769840Z 
2025-05-07T20:33:07.8769919Z     @given(
2025-05-07T20:33:07.8770077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8770184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8770299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8770425Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8770538Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8770614Z     )
2025-05-07T20:33:07.8770879Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8770977Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8771054Z         self,
2025-05-07T20:33:07.8771145Z         T: int,
2025-05-07T20:33:07.8771224Z         D: int,
2025-05-07T20:33:07.8771324Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8771420Z         contiguous: bool,
2025-05-07T20:33:07.8771504Z         compiled: bool,
2025-05-07T20:33:07.8771583Z     ) -> None:
2025-05-07T20:33:07.8771684Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8771762Z     
2025-05-07T20:33:07.8771943Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8772018Z     
2025-05-07T20:33:07.8772110Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8772243Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8772333Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8772416Z         x0 = x[:, :D]
2025-05-07T20:33:07.8772502Z         x1 = x[:, D:]
2025-05-07T20:33:07.8772577Z     
2025-05-07T20:33:07.8772662Z         if contiguous:
2025-05-07T20:33:07.8772764Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8772856Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8772931Z     
2025-05-07T20:33:07.8773034Z         if scale_ub is not None:
2025-05-07T20:33:07.8773142Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8773284Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8773362Z             )
2025-05-07T20:33:07.8773444Z         else:
2025-05-07T20:33:07.8773549Z             scale_ub_tensor = None
2025-05-07T20:33:07.8773620Z     
2025-05-07T20:33:07.8773750Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8773850Z             op = silu_mul_quant
2025-05-07T20:33:07.8773993Z             if compiled:
2025-05-07T20:33:07.8774099Z                 op = torch.compile(op)
2025-05-07T20:33:07.8774215Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8774293Z     
2025-05-07T20:33:07.8774386Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8774390Z 
2025-05-07T20:33:07.8774499Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8774631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8774742Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8774844Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8775385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8775492Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8775877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8776112Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8776483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8776580Z     kernel = self.compile(
2025-05-07T20:33:07.8777043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8777258Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8777389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8777393Z 
2025-05-07T20:33:07.8777612Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8561ee0>
2025-05-07T20:33:07.8778464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8779059Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b856d670>}
2025-05-07T20:33:07.8779876Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8780073Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8665030>
2025-05-07T20:33:07.8780084Z 
2025-05-07T20:33:07.8780253Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8780528Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8780645Z                            module_map=module_map)
2025-05-07T20:33:07.8780811Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8780913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8780998Z E       ^
2025-05-07T20:33:07.8781377Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8781382Z 
2025-05-07T20:33:07.8781835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8781839Z 
2025-05-07T20:33:07.8781943Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8782175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8782261Z     T=1,
2025-05-07T20:33:07.8782340Z     D=7168,
2025-05-07T20:33:07.8782425Z     scale_ub=None,
2025-05-07T20:33:07.8782517Z     contiguous=True,
2025-05-07T20:33:07.8782602Z     compiled=False,
2025-05-07T20:33:07.8782678Z )
2025-05-07T20:33:07.8783268Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8783670Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.8783676Z 
2025-05-07T20:33:07.8783766Z     @given(
2025-05-07T20:33:07.8783887Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8783988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8784112Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8784236Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8784359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8784446Z     )
2025-05-07T20:33:07.8784707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8784815Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8784899Z         self,
2025-05-07T20:33:07.8784982Z         T: int,
2025-05-07T20:33:07.8785072Z         D: int,
2025-05-07T20:33:07.8785173Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8785269Z         contiguous: bool,
2025-05-07T20:33:07.8785368Z         compiled: bool,
2025-05-07T20:33:07.8785458Z     ) -> None:
2025-05-07T20:33:07.8785559Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8785645Z     
2025-05-07T20:33:07.8785821Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8785902Z     
2025-05-07T20:33:07.8786009Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8786206Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8786354Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8786441Z         x0 = x[:, :D]
2025-05-07T20:33:07.8786526Z         x1 = x[:, D:]
2025-05-07T20:33:07.8786606Z     
2025-05-07T20:33:07.8786691Z         if contiguous:
2025-05-07T20:33:07.8786785Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8786886Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8787026Z     
2025-05-07T20:33:07.8787120Z         if scale_ub is not None:
2025-05-07T20:33:07.8787232Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8787375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8787449Z             )
2025-05-07T20:33:07.8787532Z         else:
2025-05-07T20:33:07.8787627Z             scale_ub_tensor = None
2025-05-07T20:33:07.8787698Z     
2025-05-07T20:33:07.8787837Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8787927Z             op = silu_mul_quant
2025-05-07T20:33:07.8788022Z             if compiled:
2025-05-07T20:33:07.8788122Z                 op = torch.compile(op)
2025-05-07T20:33:07.8788231Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8788310Z     
2025-05-07T20:33:07.8788405Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8788410Z 
2025-05-07T20:33:07.8788507Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8788646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8788755Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8788859Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8789411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8789511Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8790055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8790294Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8790666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8790770Z     kernel = self.compile(
2025-05-07T20:33:07.8791187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8791380Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8791518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8791523Z 
2025-05-07T20:33:07.8791790Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b83c39a0>
2025-05-07T20:33:07.8792647Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8793199Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8634280>}
2025-05-07T20:33:07.8794026Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8794227Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8622a30>
2025-05-07T20:33:07.8794231Z 
2025-05-07T20:33:07.8794403Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8794690Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8794800Z                            module_map=module_map)
2025-05-07T20:33:07.8794971Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8795110Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8795188Z E       ^
2025-05-07T20:33:07.8795577Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8795619Z 
2025-05-07T20:33:07.8796080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8796084Z 
2025-05-07T20:33:07.8796189Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8796475Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8796559Z     T=16384,
2025-05-07T20:33:07.8796637Z     D=7168,
2025-05-07T20:33:07.8796731Z     scale_ub=1200.0,
2025-05-07T20:33:07.8796821Z     contiguous=False,
2025-05-07T20:33:07.8796906Z     compiled=True,
2025-05-07T20:33:07.8796990Z )
2025-05-07T20:33:07.8797216Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8797413Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.8797425Z 
2025-05-07T20:33:07.8797512Z     @given(
2025-05-07T20:33:07.8797634Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8797746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8797868Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8797989Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8798116Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8798199Z     )
2025-05-07T20:33:07.8798462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8798571Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8798652Z         self,
2025-05-07T20:33:07.8798733Z         T: int,
2025-05-07T20:33:07.8798818Z         D: int,
2025-05-07T20:33:07.8798919Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8799018Z         contiguous: bool,
2025-05-07T20:33:07.8799109Z         compiled: bool,
2025-05-07T20:33:07.8799196Z     ) -> None:
2025-05-07T20:33:07.8799301Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8799381Z     
2025-05-07T20:33:07.8799558Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8799640Z     
2025-05-07T20:33:07.8799737Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8799867Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8799967Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8800056Z         x0 = x[:, :D]
2025-05-07T20:33:07.8800141Z         x1 = x[:, D:]
2025-05-07T20:33:07.8800226Z     
2025-05-07T20:33:07.8800314Z         if contiguous:
2025-05-07T20:33:07.8800456Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8800556Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8800633Z     
2025-05-07T20:33:07.8800734Z         if scale_ub is not None:
2025-05-07T20:33:07.8800844Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8800987Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8801070Z             )
2025-05-07T20:33:07.8801154Z         else:
2025-05-07T20:33:07.8801251Z             scale_ub_tensor = None
2025-05-07T20:33:07.8801331Z     
2025-05-07T20:33:07.8801462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8801551Z             op = silu_mul_quant
2025-05-07T20:33:07.8801644Z             if compiled:
2025-05-07T20:33:07.8801745Z                 op = torch.compile(op)
2025-05-07T20:33:07.8801853Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8801933Z     
2025-05-07T20:33:07.8802028Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8802033Z 
2025-05-07T20:33:07.8802143Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8802277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8802378Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8802487Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8802956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8803091Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8803637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8803735Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8804122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8804433Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8804804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8804909Z     kernel = self.compile(
2025-05-07T20:33:07.8805323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8805517Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8805652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8805660Z 
2025-05-07T20:33:07.8805877Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b83a62b0>
2025-05-07T20:33:07.8806732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8807290Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8634ee0>}
2025-05-07T20:33:07.8808110Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8808310Z context = <triton._C.libtriton.ir.context object at 0x7fd1b87a02f0>
2025-05-07T20:33:07.8808317Z 
2025-05-07T20:33:07.8808488Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8808772Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8808883Z                            module_map=module_map)
2025-05-07T20:33:07.8809058Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8809163Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8809244Z E       ^
2025-05-07T20:33:07.8809679Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8809684Z 
2025-05-07T20:33:07.8810131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8810136Z 
2025-05-07T20:33:07.8810248Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8810485Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8810572Z     T=1,
2025-05-07T20:33:07.8810655Z     D=7168,
2025-05-07T20:33:07.8810739Z     scale_ub=None,
2025-05-07T20:33:07.8810826Z     contiguous=False,
2025-05-07T20:33:07.8810917Z     compiled=False,
2025-05-07T20:33:07.8810990Z )
2025-05-07T20:33:07.8811217Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8811401Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.8811406Z 
2025-05-07T20:33:07.8811485Z     @given(
2025-05-07T20:33:07.8811607Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8811715Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8811834Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8811964Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8812081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8812200Z     )
2025-05-07T20:33:07.8812469Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8812607Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8812689Z         self,
2025-05-07T20:33:07.8812773Z         T: int,
2025-05-07T20:33:07.8812852Z         D: int,
2025-05-07T20:33:07.8812952Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8813050Z         contiguous: bool,
2025-05-07T20:33:07.8813173Z         compiled: bool,
2025-05-07T20:33:07.8813259Z     ) -> None:
2025-05-07T20:33:07.8813354Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8813428Z     
2025-05-07T20:33:07.8813613Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8813689Z     
2025-05-07T20:33:07.8813781Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8813912Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8814002Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8814087Z         x0 = x[:, :D]
2025-05-07T20:33:07.8814177Z         x1 = x[:, D:]
2025-05-07T20:33:07.8814254Z     
2025-05-07T20:33:07.8814339Z         if contiguous:
2025-05-07T20:33:07.8814441Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8814535Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8814610Z     
2025-05-07T20:33:07.8814713Z         if scale_ub is not None:
2025-05-07T20:33:07.8814824Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8814976Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8815058Z             )
2025-05-07T20:33:07.8815140Z         else:
2025-05-07T20:33:07.8815248Z             scale_ub_tensor = None
2025-05-07T20:33:07.8815326Z     
2025-05-07T20:33:07.8815461Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8815561Z             op = silu_mul_quant
2025-05-07T20:33:07.8815649Z             if compiled:
2025-05-07T20:33:07.8815752Z                 op = torch.compile(op)
2025-05-07T20:33:07.8815869Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8815946Z     
2025-05-07T20:33:07.8816043Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8816053Z 
2025-05-07T20:33:07.8816154Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8816290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8816399Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8816503Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8817046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8817199Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8817587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8817834Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8818204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8818303Z     kernel = self.compile(
2025-05-07T20:33:07.8818724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8818908Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8819042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8819048Z 
2025-05-07T20:33:07.8819270Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b87ba370>
2025-05-07T20:33:07.8820120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8820712Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b87bb670>}
2025-05-07T20:33:07.8821561Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8821768Z context = <triton._C.libtriton.ir.context object at 0x7fd1b83fb2b0>
2025-05-07T20:33:07.8821773Z 
2025-05-07T20:33:07.8821986Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8822264Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8822382Z                            module_map=module_map)
2025-05-07T20:33:07.8822545Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8822646Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8822737Z E       ^
2025-05-07T20:33:07.8823127Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8823134Z 
2025-05-07T20:33:07.8823593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8823598Z 
2025-05-07T20:33:07.8823705Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8823942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8824034Z     T=2048,
2025-05-07T20:33:07.8824115Z     D=7168,
2025-05-07T20:33:07.8824201Z     scale_ub=None,
2025-05-07T20:33:07.8824297Z     contiguous=False,
2025-05-07T20:33:07.8824386Z     compiled=True,
2025-05-07T20:33:07.8824470Z )
2025-05-07T20:33:07.8824700Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8824883Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.8824887Z 
2025-05-07T20:33:07.8824975Z     @given(
2025-05-07T20:33:07.8825101Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8825206Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8825336Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8825460Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8825577Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8825665Z     )
2025-05-07T20:33:07.8825929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8826034Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8826115Z         self,
2025-05-07T20:33:07.8826197Z         T: int,
2025-05-07T20:33:07.8826283Z         D: int,
2025-05-07T20:33:07.8826430Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8826521Z         contiguous: bool,
2025-05-07T20:33:07.8826613Z         compiled: bool,
2025-05-07T20:33:07.8826692Z     ) -> None:
2025-05-07T20:33:07.8826787Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8826864Z     
2025-05-07T20:33:07.8827039Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8827119Z     
2025-05-07T20:33:07.8827222Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8827349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8827451Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8827535Z         x0 = x[:, :D]
2025-05-07T20:33:07.8827621Z         x1 = x[:, D:]
2025-05-07T20:33:07.8827706Z     
2025-05-07T20:33:07.8827794Z         if contiguous:
2025-05-07T20:33:07.8827888Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8827990Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8828066Z     
2025-05-07T20:33:07.8828166Z         if scale_ub is not None:
2025-05-07T20:33:07.8828285Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8828424Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8828506Z             )
2025-05-07T20:33:07.8828591Z         else:
2025-05-07T20:33:07.8828688Z             scale_ub_tensor = None
2025-05-07T20:33:07.8828806Z     
2025-05-07T20:33:07.8828951Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8829079Z             op = silu_mul_quant
2025-05-07T20:33:07.8829174Z             if compiled:
2025-05-07T20:33:07.8829275Z                 op = torch.compile(op)
2025-05-07T20:33:07.8829383Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8829465Z     
2025-05-07T20:33:07.8829597Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8829601Z 
2025-05-07T20:33:07.8829699Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8829959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8830066Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8830168Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8830575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8830667Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8831213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8831316Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8831700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8831939Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8832309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8832413Z     kernel = self.compile(
2025-05-07T20:33:07.8832830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8833016Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8833158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8833165Z 
2025-05-07T20:33:07.8833382Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b85d6670>
2025-05-07T20:33:07.8834240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8834789Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b85ef550>}
2025-05-07T20:33:07.8835682Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8835889Z context = <triton._C.libtriton.ir.context object at 0x7fd1b827e570>
2025-05-07T20:33:07.8835893Z 
2025-05-07T20:33:07.8836068Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8836356Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8836470Z                            module_map=module_map)
2025-05-07T20:33:07.8836637Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8836752Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8836833Z E       ^
2025-05-07T20:33:07.8837219Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8837230Z 
2025-05-07T20:33:07.8837684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8837689Z 
2025-05-07T20:33:07.8837795Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8838035Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8838117Z     T=4096,
2025-05-07T20:33:07.8838237Z     D=7168,
2025-05-07T20:33:07.8838332Z     scale_ub=None,
2025-05-07T20:33:07.8838457Z     contiguous=False,
2025-05-07T20:33:07.8838541Z     compiled=True,
2025-05-07T20:33:07.8838624Z )
2025-05-07T20:33:07.8838852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8839043Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.8839085Z 
2025-05-07T20:33:07.8839165Z     @given(
2025-05-07T20:33:07.8839286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8839392Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8839511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8839630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8839752Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8839827Z     )
2025-05-07T20:33:07.8840090Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8840192Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8840274Z         self,
2025-05-07T20:33:07.8840356Z         T: int,
2025-05-07T20:33:07.8840432Z         D: int,
2025-05-07T20:33:07.8840531Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8840629Z         contiguous: bool,
2025-05-07T20:33:07.8840713Z         compiled: bool,
2025-05-07T20:33:07.8840791Z     ) -> None:
2025-05-07T20:33:07.8840895Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8840974Z     
2025-05-07T20:33:07.8841148Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8841228Z     
2025-05-07T20:33:07.8841322Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8841449Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8841546Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8841628Z         x0 = x[:, :D]
2025-05-07T20:33:07.8841716Z         x1 = x[:, D:]
2025-05-07T20:33:07.8841791Z     
2025-05-07T20:33:07.8841877Z         if contiguous:
2025-05-07T20:33:07.8841975Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8842068Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8842143Z     
2025-05-07T20:33:07.8842244Z         if scale_ub is not None:
2025-05-07T20:33:07.8842352Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8842491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8842576Z             )
2025-05-07T20:33:07.8842674Z         else:
2025-05-07T20:33:07.8842779Z             scale_ub_tensor = None
2025-05-07T20:33:07.8842882Z     
2025-05-07T20:33:07.8843015Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8843161Z             op = silu_mul_quant
2025-05-07T20:33:07.8843264Z             if compiled:
2025-05-07T20:33:07.8843367Z                 op = torch.compile(op)
2025-05-07T20:33:07.8843479Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8843555Z     
2025-05-07T20:33:07.8843646Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8843652Z 
2025-05-07T20:33:07.8843757Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8843899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8843999Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8844110Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8844507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8844608Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8845154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8845254Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8845644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8845877Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8846289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8846425Z     kernel = self.compile(
2025-05-07T20:33:07.8846834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8847023Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8847154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8847198Z 
2025-05-07T20:33:07.8847414Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b82619a0>
2025-05-07T20:33:07.8848271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8848823Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8367160>}
2025-05-07T20:33:07.8849650Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8849849Z context = <triton._C.libtriton.ir.context object at 0x7fd1b836a5b0>
2025-05-07T20:33:07.8849856Z 
2025-05-07T20:33:07.8850032Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8850313Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8850427Z                            module_map=module_map)
2025-05-07T20:33:07.8850600Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8850702Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8850784Z E       ^
2025-05-07T20:33:07.8851186Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8851193Z 
2025-05-07T20:33:07.8851643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8851648Z 
2025-05-07T20:33:07.8851760Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8851996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8852080Z     T=16384,
2025-05-07T20:33:07.8852165Z     D=5120,
2025-05-07T20:33:07.8852251Z     scale_ub=1200.0,
2025-05-07T20:33:07.8852343Z     contiguous=False,
2025-05-07T20:33:07.8852477Z     compiled=False,
2025-05-07T20:33:07.8852556Z )
2025-05-07T20:33:07.8852785Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8852984Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.8852988Z 
2025-05-07T20:33:07.8853073Z     @given(
2025-05-07T20:33:07.8853199Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8853301Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8853417Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8853541Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8853655Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8853730Z     )
2025-05-07T20:33:07.8854002Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8854096Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8854175Z         self,
2025-05-07T20:33:07.8854263Z         T: int,
2025-05-07T20:33:07.8854342Z         D: int,
2025-05-07T20:33:07.8854450Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8854539Z         contiguous: bool,
2025-05-07T20:33:07.8854627Z         compiled: bool,
2025-05-07T20:33:07.8854713Z     ) -> None:
2025-05-07T20:33:07.8854806Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8854922Z     
2025-05-07T20:33:07.8855104Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8855217Z     
2025-05-07T20:33:07.8855308Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8855439Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8855528Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8855616Z         x0 = x[:, :D]
2025-05-07T20:33:07.8855744Z         x1 = x[:, D:]
2025-05-07T20:33:07.8855818Z     
2025-05-07T20:33:07.8855909Z         if contiguous:
2025-05-07T20:33:07.8856000Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8856090Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8856170Z     
2025-05-07T20:33:07.8856261Z         if scale_ub is not None:
2025-05-07T20:33:07.8856368Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8856509Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8856586Z             )
2025-05-07T20:33:07.8856663Z         else:
2025-05-07T20:33:07.8856770Z             scale_ub_tensor = None
2025-05-07T20:33:07.8856858Z     
2025-05-07T20:33:07.8861585Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8861702Z             op = silu_mul_quant
2025-05-07T20:33:07.8861796Z             if compiled:
2025-05-07T20:33:07.8861901Z                 op = torch.compile(op)
2025-05-07T20:33:07.8862018Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8862100Z     
2025-05-07T20:33:07.8862202Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8862207Z 
2025-05-07T20:33:07.8862306Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8862448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8862563Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8862665Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8863225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8863334Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8863729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8863975Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8864341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8864440Z     kernel = self.compile(
2025-05-07T20:33:07.8864866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8865127Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8865265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8865277Z 
2025-05-07T20:33:07.8865491Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8173070>
2025-05-07T20:33:07.8866350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8866917Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8367940>}
2025-05-07T20:33:07.8867742Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8867948Z context = <triton._C.libtriton.ir.context object at 0x7fd1b817c730>
2025-05-07T20:33:07.8867952Z 
2025-05-07T20:33:07.8868124Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8868452Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8868579Z                            module_map=module_map)
2025-05-07T20:33:07.8868841Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8868952Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8869032Z E       ^
2025-05-07T20:33:07.8869416Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8869492Z 
2025-05-07T20:33:07.8870096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8870101Z 
2025-05-07T20:33:07.8870212Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8870450Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8870541Z     T=16384,
2025-05-07T20:33:07.8870619Z     D=5120,
2025-05-07T20:33:07.8870716Z     scale_ub=1200.0,
2025-05-07T20:33:07.8870803Z     contiguous=True,
2025-05-07T20:33:07.8870892Z     compiled=True,
2025-05-07T20:33:07.8870981Z )
2025-05-07T20:33:07.8871214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8871396Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.8871400Z 
2025-05-07T20:33:07.8871489Z     @given(
2025-05-07T20:33:07.8871610Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8871716Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8871850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8871973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8872101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8872182Z     )
2025-05-07T20:33:07.8872449Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8872555Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8872640Z         self,
2025-05-07T20:33:07.8872727Z         T: int,
2025-05-07T20:33:07.8872819Z         D: int,
2025-05-07T20:33:07.8872927Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8873019Z         contiguous: bool,
2025-05-07T20:33:07.8873117Z         compiled: bool,
2025-05-07T20:33:07.8873203Z     ) -> None:
2025-05-07T20:33:07.8873305Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8873395Z     
2025-05-07T20:33:07.8873575Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8873664Z     
2025-05-07T20:33:07.8873765Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8873895Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8874044Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8874132Z         x0 = x[:, :D]
2025-05-07T20:33:07.8874213Z         x1 = x[:, D:]
2025-05-07T20:33:07.8874297Z     
2025-05-07T20:33:07.8874382Z         if contiguous:
2025-05-07T20:33:07.8874477Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8874574Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8874650Z     
2025-05-07T20:33:07.8874745Z         if scale_ub is not None:
2025-05-07T20:33:07.8874863Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8875007Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8875095Z             )
2025-05-07T20:33:07.8875172Z         else:
2025-05-07T20:33:07.8875283Z             scale_ub_tensor = None
2025-05-07T20:33:07.8875361Z     
2025-05-07T20:33:07.8875501Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8875601Z             op = silu_mul_quant
2025-05-07T20:33:07.8875689Z             if compiled:
2025-05-07T20:33:07.8875797Z                 op = torch.compile(op)
2025-05-07T20:33:07.8875920Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8875998Z     
2025-05-07T20:33:07.8876092Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8876097Z 
2025-05-07T20:33:07.8876202Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8876387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8876539Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8876647Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8877042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8877143Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8877682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8877826Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8878218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8878453Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8878824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8878922Z     kernel = self.compile(
2025-05-07T20:33:07.8879332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8879528Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8879663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8879668Z 
2025-05-07T20:33:07.8879889Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8253730>
2025-05-07T20:33:07.8880748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8881302Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b81d7550>}
2025-05-07T20:33:07.8882129Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8882332Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8206130>
2025-05-07T20:33:07.8882337Z 
2025-05-07T20:33:07.8882517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8883114Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8883265Z                            module_map=module_map)
2025-05-07T20:33:07.8883634Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8883739Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8883819Z E       ^
2025-05-07T20:33:07.8884211Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8884216Z 
2025-05-07T20:33:07.8884673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8884680Z 
2025-05-07T20:33:07.8884791Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8885023Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8885103Z     T=16384,
2025-05-07T20:33:07.8885189Z     D=5120,
2025-05-07T20:33:07.8885278Z     scale_ub=None,
2025-05-07T20:33:07.8885370Z     contiguous=False,
2025-05-07T20:33:07.8885467Z     compiled=True,
2025-05-07T20:33:07.8885545Z )
2025-05-07T20:33:07.8885787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8885977Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.8885981Z 
2025-05-07T20:33:07.8886063Z     @given(
2025-05-07T20:33:07.8886192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8886363Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8886486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8886671Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8886786Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8886860Z     )
2025-05-07T20:33:07.8887127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8887225Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8887366Z         self,
2025-05-07T20:33:07.8887443Z         T: int,
2025-05-07T20:33:07.8887522Z         D: int,
2025-05-07T20:33:07.8887627Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8887718Z         contiguous: bool,
2025-05-07T20:33:07.8887803Z         compiled: bool,
2025-05-07T20:33:07.8887892Z     ) -> None:
2025-05-07T20:33:07.8887991Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8888069Z     
2025-05-07T20:33:07.8888252Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8888333Z     
2025-05-07T20:33:07.8888429Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8888566Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8888660Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8888752Z         x0 = x[:, :D]
2025-05-07T20:33:07.8888835Z         x1 = x[:, D:]
2025-05-07T20:33:07.8888912Z     
2025-05-07T20:33:07.8889011Z         if contiguous:
2025-05-07T20:33:07.8889108Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8889205Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8889289Z     
2025-05-07T20:33:07.8889384Z         if scale_ub is not None:
2025-05-07T20:33:07.8889497Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8889644Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8889725Z             )
2025-05-07T20:33:07.8889809Z         else:
2025-05-07T20:33:07.8889912Z             scale_ub_tensor = None
2025-05-07T20:33:07.8889990Z     
2025-05-07T20:33:07.8890129Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8890230Z             op = silu_mul_quant
2025-05-07T20:33:07.8890323Z             if compiled:
2025-05-07T20:33:07.8890436Z                 op = torch.compile(op)
2025-05-07T20:33:07.8890547Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8890628Z     
2025-05-07T20:33:07.8890733Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8890737Z 
2025-05-07T20:33:07.8890840Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8890977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8891090Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8891241Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8891648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8891743Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8892288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8892397Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8892783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8893017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8893393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8893492Z     kernel = self.compile(
2025-05-07T20:33:07.8893917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8894102Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8894237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8894242Z 
2025-05-07T20:33:07.8894508Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b828f610>
2025-05-07T20:33:07.8895359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8895953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b83471f0>}
2025-05-07T20:33:07.8896807Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8897004Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8312570>
2025-05-07T20:33:07.8897014Z 
2025-05-07T20:33:07.8897185Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8897466Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8897587Z                            module_map=module_map)
2025-05-07T20:33:07.8897754Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8897855Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8897945Z E       ^
2025-05-07T20:33:07.8898334Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8898342Z 
2025-05-07T20:33:07.8898800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8898805Z 
2025-05-07T20:33:07.8898910Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8899149Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8899240Z     T=2048,
2025-05-07T20:33:07.8899324Z     D=5120,
2025-05-07T20:33:07.8899412Z     scale_ub=None,
2025-05-07T20:33:07.8899511Z     contiguous=False,
2025-05-07T20:33:07.8899598Z     compiled=True,
2025-05-07T20:33:07.8899679Z )
2025-05-07T20:33:07.8899918Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8900104Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.8900109Z 
2025-05-07T20:33:07.8900196Z     @given(
2025-05-07T20:33:07.8900322Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8900426Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8900551Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8900717Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8900835Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8900916Z     )
2025-05-07T20:33:07.8901176Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8901271Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8901355Z         self,
2025-05-07T20:33:07.8901433Z         T: int,
2025-05-07T20:33:07.8901519Z         D: int,
2025-05-07T20:33:07.8901617Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8901706Z         contiguous: bool,
2025-05-07T20:33:07.8901796Z         compiled: bool,
2025-05-07T20:33:07.8901876Z     ) -> None:
2025-05-07T20:33:07.8901976Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8902060Z     
2025-05-07T20:33:07.8902240Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8902318Z     
2025-05-07T20:33:07.8902419Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8902550Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8902644Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8902739Z         x0 = x[:, :D]
2025-05-07T20:33:07.8902825Z         x1 = x[:, D:]
2025-05-07T20:33:07.8902907Z     
2025-05-07T20:33:07.8902995Z         if contiguous:
2025-05-07T20:33:07.8903089Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8903232Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8903371Z     
2025-05-07T20:33:07.8903464Z         if scale_ub is not None:
2025-05-07T20:33:07.8903578Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8903715Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8903793Z             )
2025-05-07T20:33:07.8903876Z         else:
2025-05-07T20:33:07.8903972Z             scale_ub_tensor = None
2025-05-07T20:33:07.8904151Z     
2025-05-07T20:33:07.8904290Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8904383Z             op = silu_mul_quant
2025-05-07T20:33:07.8904473Z             if compiled:
2025-05-07T20:33:07.8904581Z                 op = torch.compile(op)
2025-05-07T20:33:07.8904687Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8904769Z     
2025-05-07T20:33:07.8904860Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8904864Z 
2025-05-07T20:33:07.8904963Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8905107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8905213Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8905315Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8905716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8905811Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8906359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8906459Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8906843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8907083Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8907455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8907552Z     kernel = self.compile(
2025-05-07T20:33:07.8907980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8908164Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8908305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8908313Z 
2025-05-07T20:33:07.8908531Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8335520>
2025-05-07T20:33:07.8909433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8910148Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8347f70>}
2025-05-07T20:33:07.8910964Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8911172Z context = <triton._C.libtriton.ir.context object at 0x7fd1b81b8730>
2025-05-07T20:33:07.8911177Z 
2025-05-07T20:33:07.8911349Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8911640Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8911753Z                            module_map=module_map)
2025-05-07T20:33:07.8911919Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8912026Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8912104Z E       ^
2025-05-07T20:33:07.8912530Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8912535Z 
2025-05-07T20:33:07.8913026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8913030Z 
2025-05-07T20:33:07.8913134Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8913372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8913451Z     T=2048,
2025-05-07T20:33:07.8913571Z     D=5120,
2025-05-07T20:33:07.8913661Z     scale_ub=1200.0,
2025-05-07T20:33:07.8913747Z     contiguous=False,
2025-05-07T20:33:07.8913832Z     compiled=True,
2025-05-07T20:33:07.8913909Z )
2025-05-07T20:33:07.8914139Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8914321Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.8914335Z 
2025-05-07T20:33:07.8914419Z     @given(
2025-05-07T20:33:07.8914546Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8914654Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8914777Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8914898Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8915021Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8915104Z     )
2025-05-07T20:33:07.8915369Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8915479Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8915559Z         self,
2025-05-07T20:33:07.8915639Z         T: int,
2025-05-07T20:33:07.8915728Z         D: int,
2025-05-07T20:33:07.8915833Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8915930Z         contiguous: bool,
2025-05-07T20:33:07.8916019Z         compiled: bool,
2025-05-07T20:33:07.8916099Z     ) -> None:
2025-05-07T20:33:07.8916201Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8916279Z     
2025-05-07T20:33:07.8916458Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8916541Z     
2025-05-07T20:33:07.8916642Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8916771Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8916868Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8916953Z         x0 = x[:, :D]
2025-05-07T20:33:07.8917037Z         x1 = x[:, D:]
2025-05-07T20:33:07.8917120Z     
2025-05-07T20:33:07.8917217Z         if contiguous:
2025-05-07T20:33:07.8917315Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8917415Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8917492Z     
2025-05-07T20:33:07.8917639Z         if scale_ub is not None:
2025-05-07T20:33:07.8917751Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8917889Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8917973Z             )
2025-05-07T20:33:07.8918050Z         else:
2025-05-07T20:33:07.8918146Z             scale_ub_tensor = None
2025-05-07T20:33:07.8918230Z     
2025-05-07T20:33:07.8918361Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8918453Z             op = silu_mul_quant
2025-05-07T20:33:07.8918545Z             if compiled:
2025-05-07T20:33:07.8918646Z                 op = torch.compile(op)
2025-05-07T20:33:07.8918752Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8918836Z     
2025-05-07T20:33:07.8918932Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8918939Z 
2025-05-07T20:33:07.8919045Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8919179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8919286Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8919398Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8919797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8919895Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8920485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8920619Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8921011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8921245Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8921647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8921748Z     kernel = self.compile(
2025-05-07T20:33:07.8922162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8922348Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8922480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8922485Z 
2025-05-07T20:33:07.8922701Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b81bf070>
2025-05-07T20:33:07.8923559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8924108Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8195940>}
2025-05-07T20:33:07.8924934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8925131Z context = <triton._C.libtriton.ir.context object at 0x7fd1b808d5f0>
2025-05-07T20:33:07.8925136Z 
2025-05-07T20:33:07.8925311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8925601Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8925714Z                            module_map=module_map)
2025-05-07T20:33:07.8925886Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8925988Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8926070Z E       ^
2025-05-07T20:33:07.8926463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8926471Z 
2025-05-07T20:33:07.8926966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8926971Z 
2025-05-07T20:33:07.8927087Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8927320Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8927398Z     T=4096,
2025-05-07T20:33:07.8927484Z     D=5120,
2025-05-07T20:33:07.8927572Z     scale_ub=1200.0,
2025-05-07T20:33:07.8927660Z     contiguous=True,
2025-05-07T20:33:07.8927755Z     compiled=True,
2025-05-07T20:33:07.8927831Z )
2025-05-07T20:33:07.8928059Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8928243Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.8928248Z 
2025-05-07T20:33:07.8928344Z     @given(
2025-05-07T20:33:07.8928465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8928574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8928695Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8928814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8928934Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8929011Z     )
2025-05-07T20:33:07.8929268Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8929418Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8929502Z         self,
2025-05-07T20:33:07.8929623Z         T: int,
2025-05-07T20:33:07.8929712Z         D: int,
2025-05-07T20:33:07.8929816Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8929913Z         contiguous: bool,
2025-05-07T20:33:07.8930004Z         compiled: bool,
2025-05-07T20:33:07.8930087Z     ) -> None:
2025-05-07T20:33:07.8930192Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8930306Z     
2025-05-07T20:33:07.8930480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8930562Z     
2025-05-07T20:33:07.8930653Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8930784Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8930881Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8930960Z         x0 = x[:, :D]
2025-05-07T20:33:07.8931043Z         x1 = x[:, D:]
2025-05-07T20:33:07.8931125Z     
2025-05-07T20:33:07.8931208Z         if contiguous:
2025-05-07T20:33:07.8931304Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8931403Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8931483Z     
2025-05-07T20:33:07.8931581Z         if scale_ub is not None:
2025-05-07T20:33:07.8931688Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8931824Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8931905Z             )
2025-05-07T20:33:07.8931983Z         else:
2025-05-07T20:33:07.8932082Z             scale_ub_tensor = None
2025-05-07T20:33:07.8932166Z     
2025-05-07T20:33:07.8932299Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8932396Z             op = silu_mul_quant
2025-05-07T20:33:07.8932487Z             if compiled:
2025-05-07T20:33:07.8932589Z                 op = torch.compile(op)
2025-05-07T20:33:07.8932695Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8932774Z     
2025-05-07T20:33:07.8932866Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8932870Z 
2025-05-07T20:33:07.8932981Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8933120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8933225Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8933336Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8933734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8933832Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8934381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8934528Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8934924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8935158Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8935523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8935626Z     kernel = self.compile(
2025-05-07T20:33:07.8936037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8936225Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8936357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8936364Z 
2025-05-07T20:33:07.8936579Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b80b9c10>
2025-05-07T20:33:07.8937442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8938066Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b80f7790>}
2025-05-07T20:33:07.8938924Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8939124Z context = <triton._C.libtriton.ir.context object at 0x7fd1b80d56f0>
2025-05-07T20:33:07.8939165Z 
2025-05-07T20:33:07.8939340Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8939625Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8939737Z                            module_map=module_map)
2025-05-07T20:33:07.8939908Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8940007Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8940084Z E       ^
2025-05-07T20:33:07.8940476Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8940483Z 
2025-05-07T20:33:07.8940931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8940936Z 
2025-05-07T20:33:07.8941049Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8941282Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8941364Z     T=128,
2025-05-07T20:33:07.8941451Z     D=5120,
2025-05-07T20:33:07.8941534Z     scale_ub=1200.0,
2025-05-07T20:33:07.8941622Z     contiguous=False,
2025-05-07T20:33:07.8941714Z     compiled=True,
2025-05-07T20:33:07.8941788Z )
2025-05-07T20:33:07.8942016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8942200Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.8942205Z 
2025-05-07T20:33:07.8942284Z     @given(
2025-05-07T20:33:07.8942405Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8942515Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8942633Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8942756Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8942870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8942945Z     )
2025-05-07T20:33:07.8943208Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8943310Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8943391Z         self,
2025-05-07T20:33:07.8943479Z         T: int,
2025-05-07T20:33:07.8943607Z         D: int,
2025-05-07T20:33:07.8943711Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8943810Z         contiguous: bool,
2025-05-07T20:33:07.8943899Z         compiled: bool,
2025-05-07T20:33:07.8943987Z     ) -> None:
2025-05-07T20:33:07.8944088Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8944165Z     
2025-05-07T20:33:07.8944350Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8944432Z     
2025-05-07T20:33:07.8944530Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8944666Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8944760Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8944844Z         x0 = x[:, :D]
2025-05-07T20:33:07.8944934Z         x1 = x[:, D:]
2025-05-07T20:33:07.8945013Z     
2025-05-07T20:33:07.8945103Z         if contiguous:
2025-05-07T20:33:07.8945204Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8945298Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8945378Z     
2025-05-07T20:33:07.8945481Z         if scale_ub is not None:
2025-05-07T20:33:07.8945590Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8945736Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8945814Z             )
2025-05-07T20:33:07.8945892Z         else:
2025-05-07T20:33:07.8946039Z             scale_ub_tensor = None
2025-05-07T20:33:07.8946117Z     
2025-05-07T20:33:07.8946286Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8946384Z             op = silu_mul_quant
2025-05-07T20:33:07.8946471Z             if compiled:
2025-05-07T20:33:07.8946570Z                 op = torch.compile(op)
2025-05-07T20:33:07.8946684Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8946754Z     
2025-05-07T20:33:07.8946888Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8946900Z 
2025-05-07T20:33:07.8947000Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8947136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8947243Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8947343Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8947737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8947842Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8948380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8948481Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8948870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8949103Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8949481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8949576Z     kernel = self.compile(
2025-05-07T20:33:07.8950130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8950318Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8950454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8950458Z 
2025-05-07T20:33:07.8950682Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b8068220>
2025-05-07T20:33:07.8951532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8952084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b80690d0>}
2025-05-07T20:33:07.8952951Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8953150Z context = <triton._C.libtriton.ir.context object at 0x7fd1b806a170>
2025-05-07T20:33:07.8953155Z 
2025-05-07T20:33:07.8953334Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8953615Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8953720Z                            module_map=module_map)
2025-05-07T20:33:07.8953894Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8953992Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8954081Z E       ^
2025-05-07T20:33:07.8954463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8954468Z 
2025-05-07T20:33:07.8954917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8954922Z 
2025-05-07T20:33:07.8955037Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8955270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8955399Z     T=16384,
2025-05-07T20:33:07.8955478Z     D=7168,
2025-05-07T20:33:07.8955599Z     scale_ub=1200.0,
2025-05-07T20:33:07.8955690Z     contiguous=True,
2025-05-07T20:33:07.8955775Z     compiled=True,
2025-05-07T20:33:07.8955850Z )
2025-05-07T20:33:07.8956083Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8956265Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.8956308Z 
2025-05-07T20:33:07.8956384Z     @given(
2025-05-07T20:33:07.8956509Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8956608Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8956733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8956851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8956964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8957042Z     )
2025-05-07T20:33:07.8957306Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8957404Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8957492Z         self,
2025-05-07T20:33:07.8957575Z         T: int,
2025-05-07T20:33:07.8957655Z         D: int,
2025-05-07T20:33:07.8957763Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8957857Z         contiguous: bool,
2025-05-07T20:33:07.8957946Z         compiled: bool,
2025-05-07T20:33:07.8958034Z     ) -> None:
2025-05-07T20:33:07.8958134Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8958211Z     
2025-05-07T20:33:07.8958394Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8958472Z     
2025-05-07T20:33:07.8958578Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8958706Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8958799Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8958893Z         x0 = x[:, :D]
2025-05-07T20:33:07.8958978Z         x1 = x[:, D:]
2025-05-07T20:33:07.8959055Z     
2025-05-07T20:33:07.8959151Z         if contiguous:
2025-05-07T20:33:07.8959251Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8959347Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8959434Z     
2025-05-07T20:33:07.8959529Z         if scale_ub is not None:
2025-05-07T20:33:07.8959640Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8959788Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8959876Z             )
2025-05-07T20:33:07.8959966Z         else:
2025-05-07T20:33:07.8960065Z             scale_ub_tensor = None
2025-05-07T20:33:07.8960142Z     
2025-05-07T20:33:07.8960329Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8960422Z             op = silu_mul_quant
2025-05-07T20:33:07.8960507Z             if compiled:
2025-05-07T20:33:07.8960615Z                 op = torch.compile(op)
2025-05-07T20:33:07.8960724Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8960799Z     
2025-05-07T20:33:07.8960900Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8960905Z 
2025-05-07T20:33:07.8961006Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8961147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8961249Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8961349Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8961750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.8961846Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.8962386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8962491Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8962873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8963155Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8963522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8963655Z     kernel = self.compile(
2025-05-07T20:33:07.8964070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8964250Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8964423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8964434Z 
2025-05-07T20:33:07.8964653Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b80bf6a0>
2025-05-07T20:33:07.8965506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8966066Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8069d30>}
2025-05-07T20:33:07.8966889Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8967093Z context = <triton._C.libtriton.ir.context object at 0x7fd1b8116070>
2025-05-07T20:33:07.8967101Z 
2025-05-07T20:33:07.8967275Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8967559Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8967680Z                            module_map=module_map)
2025-05-07T20:33:07.8967848Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8967951Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8968040Z E       ^
2025-05-07T20:33:07.8968426Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8968434Z 
2025-05-07T20:33:07.8968889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8968894Z 
2025-05-07T20:33:07.8969000Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8969235Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8969329Z     T=16384,
2025-05-07T20:33:07.8969410Z     D=5120,
2025-05-07T20:33:07.8969505Z     scale_ub=1200.0,
2025-05-07T20:33:07.8969668Z     contiguous=True,
2025-05-07T20:33:07.8969755Z     compiled=False,
2025-05-07T20:33:07.8969838Z )
2025-05-07T20:33:07.8970066Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8970250Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.8970255Z 
2025-05-07T20:33:07.8970342Z     @given(
2025-05-07T20:33:07.8970464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8970566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8970687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8970804Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8970925Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8971005Z     )
2025-05-07T20:33:07.8971267Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8971370Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8971449Z         self,
2025-05-07T20:33:07.8971530Z         T: int,
2025-05-07T20:33:07.8971615Z         D: int,
2025-05-07T20:33:07.8971715Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8971803Z         contiguous: bool,
2025-05-07T20:33:07.8971894Z         compiled: bool,
2025-05-07T20:33:07.8971973Z     ) -> None:
2025-05-07T20:33:07.8972111Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8972197Z     
2025-05-07T20:33:07.8972408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8972483Z     
2025-05-07T20:33:07.8972582Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8972708Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8972808Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8972891Z         x0 = x[:, :D]
2025-05-07T20:33:07.8973013Z         x1 = x[:, D:]
2025-05-07T20:33:07.8973093Z     
2025-05-07T20:33:07.8973177Z         if contiguous:
2025-05-07T20:33:07.8973269Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8973372Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8973447Z     
2025-05-07T20:33:07.8973539Z         if scale_ub is not None:
2025-05-07T20:33:07.8973653Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8973791Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8973866Z             )
2025-05-07T20:33:07.8973961Z         else:
2025-05-07T20:33:07.8974057Z             scale_ub_tensor = None
2025-05-07T20:33:07.8974145Z     
2025-05-07T20:33:07.8974277Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8974368Z             op = silu_mul_quant
2025-05-07T20:33:07.8974466Z             if compiled:
2025-05-07T20:33:07.8974568Z                 op = torch.compile(op)
2025-05-07T20:33:07.8974674Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8974760Z     
2025-05-07T20:33:07.8974853Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8974858Z 
2025-05-07T20:33:07.8974955Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8975102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8975209Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8975320Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8975867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8975969Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8976365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8976603Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8976971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8977078Z     kernel = self.compile(
2025-05-07T20:33:07.8977539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8977732Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8977864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8977869Z 
2025-05-07T20:33:07.8978087Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7fef460>
2025-05-07T20:33:07.8978947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8979499Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b8018700>}
2025-05-07T20:33:07.8980333Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.8980532Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7fe0430>
2025-05-07T20:33:07.8980536Z 
2025-05-07T20:33:07.8980716Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.8981039Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.8981185Z                            module_map=module_map)
2025-05-07T20:33:07.8981356Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.8981456Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.8981534Z E       ^
2025-05-07T20:33:07.8981924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.8981967Z 
2025-05-07T20:33:07.8982417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.8982424Z 
2025-05-07T20:33:07.8982531Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.8983017Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.8983132Z     T=1,
2025-05-07T20:33:07.8983246Z     D=7168,
2025-05-07T20:33:07.8983362Z     scale_ub=1200.0,
2025-05-07T20:33:07.8983482Z     contiguous=False,
2025-05-07T20:33:07.8983611Z     compiled=False,
2025-05-07T20:33:07.8983715Z )
2025-05-07T20:33:07.8983958Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.8984141Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.8984146Z 
2025-05-07T20:33:07.8984227Z     @given(
2025-05-07T20:33:07.8984356Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.8984462Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.8984582Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.8984714Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.8984830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.8984911Z     )
2025-05-07T20:33:07.8986621Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.8986718Z     def test_silu_mul_quant(
2025-05-07T20:33:07.8986798Z         self,
2025-05-07T20:33:07.8986889Z         T: int,
2025-05-07T20:33:07.8986971Z         D: int,
2025-05-07T20:33:07.8987083Z         scale_ub: Optional[float],
2025-05-07T20:33:07.8987174Z         contiguous: bool,
2025-05-07T20:33:07.8987265Z         compiled: bool,
2025-05-07T20:33:07.8987353Z     ) -> None:
2025-05-07T20:33:07.8987451Z         torch.manual_seed(2025)
2025-05-07T20:33:07.8987527Z     
2025-05-07T20:33:07.8987710Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.8987792Z     
2025-05-07T20:33:07.8987886Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.8988022Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.8988288Z         x = x_sign * x_clamp
2025-05-07T20:33:07.8988373Z         x0 = x[:, :D]
2025-05-07T20:33:07.8988467Z         x1 = x[:, D:]
2025-05-07T20:33:07.8988543Z     
2025-05-07T20:33:07.8988628Z         if contiguous:
2025-05-07T20:33:07.8988731Z             x0 = x0.contiguous()
2025-05-07T20:33:07.8988825Z             x1 = x1.contiguous()
2025-05-07T20:33:07.8988911Z     
2025-05-07T20:33:07.8989009Z         if scale_ub is not None:
2025-05-07T20:33:07.8989118Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.8989263Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.8989340Z             )
2025-05-07T20:33:07.8989418Z         else:
2025-05-07T20:33:07.8989520Z             scale_ub_tensor = None
2025-05-07T20:33:07.8989599Z     
2025-05-07T20:33:07.8989733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.8989936Z             op = silu_mul_quant
2025-05-07T20:33:07.8990025Z             if compiled:
2025-05-07T20:33:07.8990131Z                 op = torch.compile(op)
2025-05-07T20:33:07.8990246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8990323Z     
2025-05-07T20:33:07.8990422Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.8990427Z 
2025-05-07T20:33:07.8990526Z moe/activation_test.py:117: 
2025-05-07T20:33:07.8990745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8990911Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.8991026Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.8996123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.8996248Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.8996763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.8997010Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.8997380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.8997479Z     kernel = self.compile(
2025-05-07T20:33:07.8997908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.8998095Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.8998233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.8998248Z 
2025-05-07T20:33:07.8998464Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7e6c3a0>
2025-05-07T20:33:07.8999317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.8999892Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7f530d0>}
2025-05-07T20:33:07.9000719Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.9000931Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7f547f0>
2025-05-07T20:33:07.9000937Z 
2025-05-07T20:33:07.9001112Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.9001394Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.9001519Z                            module_map=module_map)
2025-05-07T20:33:07.9001694Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.9001808Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.9001897Z E       ^
2025-05-07T20:33:07.9002335Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.9002341Z 
2025-05-07T20:33:07.9002832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.9002838Z 
2025-05-07T20:33:07.9002968Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9003204Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9003298Z     T=4096,
2025-05-07T20:33:07.9003380Z     D=7168,
2025-05-07T20:33:07.9003481Z     scale_ub=1200.0,
2025-05-07T20:33:07.9003573Z     contiguous=False,
2025-05-07T20:33:07.9003666Z     compiled=True,
2025-05-07T20:33:07.9003757Z )
2025-05-07T20:33:07.9003993Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9004183Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.9004188Z 
2025-05-07T20:33:07.9004283Z     @given(
2025-05-07T20:33:07.9004408Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9004515Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9004652Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9004818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9004946Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9005065Z     )
2025-05-07T20:33:07.9005329Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9005440Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9005527Z         self,
2025-05-07T20:33:07.9005613Z         T: int,
2025-05-07T20:33:07.9005712Z         D: int,
2025-05-07T20:33:07.9005888Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9005984Z         contiguous: bool,
2025-05-07T20:33:07.9006081Z         compiled: bool,
2025-05-07T20:33:07.9006165Z     ) -> None:
2025-05-07T20:33:07.9006267Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9006354Z     
2025-05-07T20:33:07.9006531Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9006619Z     
2025-05-07T20:33:07.9006716Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9006844Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9006947Z         x = x_sign * x_clamp
2025-05-07T20:33:07.9007035Z         x0 = x[:, :D]
2025-05-07T20:33:07.9007121Z         x1 = x[:, D:]
2025-05-07T20:33:07.9007205Z     
2025-05-07T20:33:07.9007293Z         if contiguous:
2025-05-07T20:33:07.9007389Z             x0 = x0.contiguous()
2025-05-07T20:33:07.9007494Z             x1 = x1.contiguous()
2025-05-07T20:33:07.9007571Z     
2025-05-07T20:33:07.9007668Z         if scale_ub is not None:
2025-05-07T20:33:07.9007788Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.9007931Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.9008024Z             )
2025-05-07T20:33:07.9008112Z         else:
2025-05-07T20:33:07.9008213Z             scale_ub_tensor = None
2025-05-07T20:33:07.9008297Z     
2025-05-07T20:33:07.9008434Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.9008530Z             op = silu_mul_quant
2025-05-07T20:33:07.9008628Z             if compiled:
2025-05-07T20:33:07.9008739Z                 op = torch.compile(op)
2025-05-07T20:33:07.9008854Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9008943Z     
2025-05-07T20:33:07.9009042Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.9009046Z 
2025-05-07T20:33:07.9009152Z moe/activation_test.py:117: 
2025-05-07T20:33:07.9009290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9009403Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.9009517Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9009964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.9010067Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.9010607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.9010709Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.9011103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.9011346Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.9011721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.9011819Z     kernel = self.compile(
2025-05-07T20:33:07.9012233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.9012425Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.9012564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9012569Z 
2025-05-07T20:33:07.9012786Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7d97fa0>
2025-05-07T20:33:07.9013686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.9014273Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7f53dc0>}
2025-05-07T20:33:07.9015095Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.9015337Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7ea7af0>
2025-05-07T20:33:07.9015341Z 
2025-05-07T20:33:07.9015521Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.9015798Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.9015909Z                            module_map=module_map)
2025-05-07T20:33:07.9016083Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.9016188Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.9016267Z E       ^
2025-05-07T20:33:07.9016656Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.9016661Z 
2025-05-07T20:33:07.9017110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.9017118Z 
2025-05-07T20:33:07.9017232Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9017468Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9017552Z     T=128,
2025-05-07T20:33:07.9017642Z     D=7168,
2025-05-07T20:33:07.9017730Z     scale_ub=1200.0,
2025-05-07T20:33:07.9017820Z     contiguous=False,
2025-05-07T20:33:07.9017917Z     compiled=True,
2025-05-07T20:33:07.9017997Z )
2025-05-07T20:33:07.9018237Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9018423Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.9018428Z 
2025-05-07T20:33:07.9018511Z     @given(
2025-05-07T20:33:07.9018642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9018749Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9018869Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9019006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9019125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9019206Z     )
2025-05-07T20:33:07.9019521Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9019620Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9019702Z         self,
2025-05-07T20:33:07.9019783Z         T: int,
2025-05-07T20:33:07.9019863Z         D: int,
2025-05-07T20:33:07.9019977Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9020070Z         contiguous: bool,
2025-05-07T20:33:07.9020158Z         compiled: bool,
2025-05-07T20:33:07.9020250Z     ) -> None:
2025-05-07T20:33:07.9020349Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9020426Z     
2025-05-07T20:33:07.9020609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9020690Z     
2025-05-07T20:33:07.9020786Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9020925Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9021021Z         x = x_sign * x_clamp
2025-05-07T20:33:07.9021114Z         x0 = x[:, :D]
2025-05-07T20:33:07.9021203Z         x1 = x[:, D:]
2025-05-07T20:33:07.9021282Z     
2025-05-07T20:33:07.9021378Z         if contiguous:
2025-05-07T20:33:07.9021478Z             x0 = x0.contiguous()
2025-05-07T20:33:07.9021572Z             x1 = x1.contiguous()
2025-05-07T20:33:07.9021659Z     
2025-05-07T20:33:07.9021756Z         if scale_ub is not None:
2025-05-07T20:33:07.9021908Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.9022104Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.9022184Z             )
2025-05-07T20:33:07.9022265Z         else:
2025-05-07T20:33:07.9022371Z             scale_ub_tensor = None
2025-05-07T20:33:07.9022454Z     
2025-05-07T20:33:07.9022590Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.9022693Z             op = silu_mul_quant
2025-05-07T20:33:07.9022826Z             if compiled:
2025-05-07T20:33:07.9022936Z                 op = torch.compile(op)
2025-05-07T20:33:07.9023045Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9023126Z     
2025-05-07T20:33:07.9023225Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.9023230Z 
2025-05-07T20:33:07.9023330Z moe/activation_test.py:117: 
2025-05-07T20:33:07.9023465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9023578Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.9023687Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9024093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.9024191Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.9024735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.9024849Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.9025238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.9025480Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.9025854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.9025955Z     kernel = self.compile(
2025-05-07T20:33:07.9026381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.9026567Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.9026702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9026706Z 
2025-05-07T20:33:07.9026930Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7d98b50>
2025-05-07T20:33:07.9027782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.9028385Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7e64940>}
2025-05-07T20:33:07.9029205Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.9029407Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7f95930>
2025-05-07T20:33:07.9029418Z 
2025-05-07T20:33:07.9029591Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.9030000Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.9030120Z                            module_map=module_map)
2025-05-07T20:33:07.9030285Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.9030390Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.9030477Z E       ^
2025-05-07T20:33:07.9030858Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.9030863Z 
2025-05-07T20:33:07.9031364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.9031369Z 
2025-05-07T20:33:07.9031511Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9031744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9031832Z     T=2048,
2025-05-07T20:33:07.9031910Z     D=7168,
2025-05-07T20:33:07.9031994Z     scale_ub=None,
2025-05-07T20:33:07.9032088Z     contiguous=True,
2025-05-07T20:33:07.9032212Z     compiled=True,
2025-05-07T20:33:07.9032289Z )
2025-05-07T20:33:07.9032527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9032736Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.9032740Z 
2025-05-07T20:33:07.9032845Z     @given(
2025-05-07T20:33:07.9032980Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9033083Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9033210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9033336Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9033456Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9033544Z     )
2025-05-07T20:33:07.9033809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9033907Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9033998Z         self,
2025-05-07T20:33:07.9034081Z         T: int,
2025-05-07T20:33:07.9034171Z         D: int,
2025-05-07T20:33:07.9034276Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9034370Z         contiguous: bool,
2025-05-07T20:33:07.9034466Z         compiled: bool,
2025-05-07T20:33:07.9034553Z     ) -> None:
2025-05-07T20:33:07.9034655Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9034740Z     
2025-05-07T20:33:07.9034918Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9034998Z     
2025-05-07T20:33:07.9035100Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9035232Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9035329Z         x = x_sign * x_clamp
2025-05-07T20:33:07.9035426Z         x0 = x[:, :D]
2025-05-07T20:33:07.9035509Z         x1 = x[:, D:]
2025-05-07T20:33:07.9035591Z     
2025-05-07T20:33:07.9035678Z         if contiguous:
2025-05-07T20:33:07.9035776Z             x0 = x0.contiguous()
2025-05-07T20:33:07.9035875Z             x1 = x1.contiguous()
2025-05-07T20:33:07.9035958Z     
2025-05-07T20:33:07.9036052Z         if scale_ub is not None:
2025-05-07T20:33:07.9036167Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.9036307Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.9036437Z             )
2025-05-07T20:33:07.9036524Z         else:
2025-05-07T20:33:07.9036622Z             scale_ub_tensor = None
2025-05-07T20:33:07.9036697Z     
2025-05-07T20:33:07.9036835Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.9036926Z             op = silu_mul_quant
2025-05-07T20:33:07.9037015Z             if compiled:
2025-05-07T20:33:07.9037123Z                 op = torch.compile(op)
2025-05-07T20:33:07.9037232Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9037311Z     
2025-05-07T20:33:07.9037404Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.9037409Z 
2025-05-07T20:33:07.9037509Z moe/activation_test.py:117: 
2025-05-07T20:33:07.9037651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9037760Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.9037866Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9038278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.9038377Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.9038927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.9039073Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.9039458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.9039768Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.9040136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.9040235Z     kernel = self.compile(
2025-05-07T20:33:07.9040699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.9040882Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.9041024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9041028Z 
2025-05-07T20:33:07.9041245Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b806c880>
2025-05-07T20:33:07.9042100Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.9042670Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7c6a550>}
2025-05-07T20:33:07.9043539Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.9043752Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7f1dbb0>
2025-05-07T20:33:07.9043756Z 
2025-05-07T20:33:07.9043930Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.9044222Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.9044338Z                            module_map=module_map)
2025-05-07T20:33:07.9044507Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.9044625Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.9044710Z E       ^
2025-05-07T20:33:07.9045102Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.9045106Z 
2025-05-07T20:33:07.9045570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.9045575Z 
2025-05-07T20:33:07.9045683Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9045976Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9046059Z     T=16384,
2025-05-07T20:33:07.9046139Z     D=5120,
2025-05-07T20:33:07.9046229Z     scale_ub=None,
2025-05-07T20:33:07.9046317Z     contiguous=False,
2025-05-07T20:33:07.9046408Z     compiled=False,
2025-05-07T20:33:07.9046492Z )
2025-05-07T20:33:07.9046720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9046909Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.9046920Z 
2025-05-07T20:33:07.9047000Z     @given(
2025-05-07T20:33:07.9047120Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9047228Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9047347Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9047468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9047589Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9047669Z     )
2025-05-07T20:33:07.9047930Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9048035Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9048118Z         self,
2025-05-07T20:33:07.9048200Z         T: int,
2025-05-07T20:33:07.9048288Z         D: int,
2025-05-07T20:33:07.9048431Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9048632Z         contiguous: bool,
2025-05-07T20:33:07.9048719Z         compiled: bool,
2025-05-07T20:33:07.9048799Z     ) -> None:
2025-05-07T20:33:07.9048901Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9048975Z     
2025-05-07T20:33:07.9049147Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9049235Z     
2025-05-07T20:33:07.9049368Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9049495Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9051501Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9051510Z 
2025-05-07T20:33:07.9051634Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.9051639Z 
2025-05-07T20:33:07.9051753Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9051989Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9052080Z     T=4096,
2025-05-07T20:33:07.9052163Z     D=7168,
2025-05-07T20:33:07.9052255Z     scale_ub=1200.0,
2025-05-07T20:33:07.9052350Z     contiguous=True,
2025-05-07T20:33:07.9052441Z     compiled=True,
2025-05-07T20:33:07.9052522Z )
2025-05-07T20:33:07.9052787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9052994Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.9052998Z 
2025-05-07T20:33:07.9053081Z     @given(
2025-05-07T20:33:07.9053213Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9053320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9053448Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9053569Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9053686Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9053771Z     )
2025-05-07T20:33:07.9054035Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9054133Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9054224Z         self,
2025-05-07T20:33:07.9054354Z         T: int,
2025-05-07T20:33:07.9054435Z         D: int,
2025-05-07T20:33:07.9054543Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9054633Z         contiguous: bool,
2025-05-07T20:33:07.9054722Z         compiled: bool,
2025-05-07T20:33:07.9054806Z     ) -> None:
2025-05-07T20:33:07.9054903Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9054985Z     
2025-05-07T20:33:07.9055157Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9055237Z     
2025-05-07T20:33:07.9055342Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9055468Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9057441Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9057457Z 
2025-05-07T20:33:07.9057576Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.9057622Z 
2025-05-07T20:33:07.9057733Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9058007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9058089Z     T=16384,
2025-05-07T20:33:07.9058172Z     D=7168,
2025-05-07T20:33:07.9058269Z     scale_ub=None,
2025-05-07T20:33:07.9058360Z     contiguous=False,
2025-05-07T20:33:07.9058457Z     compiled=False,
2025-05-07T20:33:07.9058579Z )
2025-05-07T20:33:07.9058809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9059000Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.9059008Z 
2025-05-07T20:33:07.9059087Z     @given(
2025-05-07T20:33:07.9059206Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9059316Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9059432Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9059552Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9059674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9059754Z     )
2025-05-07T20:33:07.9060023Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9060119Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9060196Z         self,
2025-05-07T20:33:07.9060284Z         T: int,
2025-05-07T20:33:07.9060365Z         D: int,
2025-05-07T20:33:07.9060467Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9060568Z         contiguous: bool,
2025-05-07T20:33:07.9060656Z         compiled: bool,
2025-05-07T20:33:07.9060738Z     ) -> None:
2025-05-07T20:33:07.9060846Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9060922Z     
2025-05-07T20:33:07.9061095Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9063088Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9063102Z 
2025-05-07T20:33:07.9063225Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9063236Z 
2025-05-07T20:33:07.9063342Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9063621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9063708Z     T=2048,
2025-05-07T20:33:07.9063788Z     D=7168,
2025-05-07T20:33:07.9063886Z     scale_ub=1200.0,
2025-05-07T20:33:07.9063974Z     contiguous=True,
2025-05-07T20:33:07.9064070Z     compiled=True,
2025-05-07T20:33:07.9064145Z )
2025-05-07T20:33:07.9064373Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9064563Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.9064568Z 
2025-05-07T20:33:07.9064657Z     @given(
2025-05-07T20:33:07.9064786Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9064890Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9065012Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9065140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9065258Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9065342Z     )
2025-05-07T20:33:07.9065612Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9065711Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9065793Z         self,
2025-05-07T20:33:07.9065883Z         T: int,
2025-05-07T20:33:07.9065965Z         D: int,
2025-05-07T20:33:07.9066116Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9066210Z         contiguous: bool,
2025-05-07T20:33:07.9066336Z         compiled: bool,
2025-05-07T20:33:07.9066425Z     ) -> None:
2025-05-07T20:33:07.9066522Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9066602Z     
2025-05-07T20:33:07.9066783Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9066860Z     
2025-05-07T20:33:07.9066998Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9067131Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9069084Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9069093Z 
2025-05-07T20:33:07.9069219Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.9069223Z 
2025-05-07T20:33:07.9069332Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9069576Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9069659Z     T=2048,
2025-05-07T20:33:07.9069739Z     D=7168,
2025-05-07T20:33:07.9069966Z     scale_ub=None,
2025-05-07T20:33:07.9070060Z     contiguous=True,
2025-05-07T20:33:07.9070147Z     compiled=False,
2025-05-07T20:33:07.9070231Z )
2025-05-07T20:33:07.9070458Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9070635Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.9070639Z 
2025-05-07T20:33:07.9070730Z     @given(
2025-05-07T20:33:07.9070856Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9070966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9071087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9071208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9071331Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9071411Z     )
2025-05-07T20:33:07.9071673Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9071784Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9071868Z         self,
2025-05-07T20:33:07.9071952Z         T: int,
2025-05-07T20:33:07.9072090Z         D: int,
2025-05-07T20:33:07.9072191Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9072279Z         contiguous: bool,
2025-05-07T20:33:07.9072375Z         compiled: bool,
2025-05-07T20:33:07.9072459Z     ) -> None:
2025-05-07T20:33:07.9072565Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9072643Z     
2025-05-07T20:33:07.9072820Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9072907Z     
2025-05-07T20:33:07.9073004Z >       x_sign = torch.sign(x)
2025-05-07T20:33:07.9074955Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9074968Z 
2025-05-07T20:33:07.9075090Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:07.9075095Z 
2025-05-07T20:33:07.9075202Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9075513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9075630Z     T=1,
2025-05-07T20:33:07.9075710Z     D=7168,
2025-05-07T20:33:07.9075805Z     scale_ub=1200.0,
2025-05-07T20:33:07.9075896Z     contiguous=True,
2025-05-07T20:33:07.9075985Z     compiled=False,
2025-05-07T20:33:07.9076072Z )
2025-05-07T20:33:07.9076303Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9076521Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.9076526Z 
2025-05-07T20:33:07.9076604Z     @given(
2025-05-07T20:33:07.9076726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9076836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9076952Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9077071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9077197Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9077280Z     )
2025-05-07T20:33:07.9077549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9077650Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9077733Z         self,
2025-05-07T20:33:07.9077822Z         T: int,
2025-05-07T20:33:07.9077906Z         D: int,
2025-05-07T20:33:07.9078009Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9078108Z         contiguous: bool,
2025-05-07T20:33:07.9078203Z         compiled: bool,
2025-05-07T20:33:07.9078286Z     ) -> None:
2025-05-07T20:33:07.9078392Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9078473Z     
2025-05-07T20:33:07.9078651Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9078740Z     
2025-05-07T20:33:07.9078842Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9078970Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9079071Z         x = x_sign * x_clamp
2025-05-07T20:33:07.9079158Z         x0 = x[:, :D]
2025-05-07T20:33:07.9079251Z         x1 = x[:, D:]
2025-05-07T20:33:07.9079327Z     
2025-05-07T20:33:07.9079417Z         if contiguous:
2025-05-07T20:33:07.9079521Z             x0 = x0.contiguous()
2025-05-07T20:33:07.9079616Z             x1 = x1.contiguous()
2025-05-07T20:33:07.9079695Z     
2025-05-07T20:33:07.9079797Z         if scale_ub is not None:
2025-05-07T20:33:07.9079907Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.9080051Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.9080138Z             )
2025-05-07T20:33:07.9080220Z         else:
2025-05-07T20:33:07.9080318Z             scale_ub_tensor = None
2025-05-07T20:33:07.9080449Z     
2025-05-07T20:33:07.9080585Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.9080685Z             op = silu_mul_quant
2025-05-07T20:33:07.9080771Z             if compiled:
2025-05-07T20:33:07.9080875Z                 op = torch.compile(op)
2025-05-07T20:33:07.9080990Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9081064Z     
2025-05-07T20:33:07.9081156Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.9081162Z 
2025-05-07T20:33:07.9081267Z moe/activation_test.py:117: 
2025-05-07T20:33:07.9081401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9081505Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.9081610Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9082158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.9082260Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.9082651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.9083225Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.9083823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.9083929Z     kernel = self.compile(
2025-05-07T20:33:07.9084414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.9084605Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.9084736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9084805Z 
2025-05-07T20:33:07.9085026Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7b6ea90>
2025-05-07T20:33:07.9085881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.9086441Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7cdc040>}
2025-05-07T20:33:07.9087258Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.9087460Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7ccdef0>
2025-05-07T20:33:07.9087464Z 
2025-05-07T20:33:07.9087642Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.9087923Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.9088043Z                            module_map=module_map)
2025-05-07T20:33:07.9088211Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.9088314Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.9088403Z E       ^
2025-05-07T20:33:07.9088792Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.9088797Z 
2025-05-07T20:33:07.9089250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.9089260Z 
2025-05-07T20:33:07.9089368Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9089604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9089694Z     T=128,
2025-05-07T20:33:07.9089780Z     D=5120,
2025-05-07T20:33:07.9089866Z     scale_ub=None,
2025-05-07T20:33:07.9089960Z     contiguous=True,
2025-05-07T20:33:07.9090050Z     compiled=False,
2025-05-07T20:33:07.9090127Z )
2025-05-07T20:33:07.9090433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9090613Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.9090618Z 
2025-05-07T20:33:07.9090709Z     @given(
2025-05-07T20:33:07.9090834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9090940Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9091068Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9091197Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9091312Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9091395Z     )
2025-05-07T20:33:07.9091660Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9091757Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9091847Z         self,
2025-05-07T20:33:07.9091931Z         T: int,
2025-05-07T20:33:07.9092013Z         D: int,
2025-05-07T20:33:07.9092125Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9092218Z         contiguous: bool,
2025-05-07T20:33:07.9092311Z         compiled: bool,
2025-05-07T20:33:07.9092393Z     ) -> None:
2025-05-07T20:33:07.9092492Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9092581Z     
2025-05-07T20:33:07.9092835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9092916Z     
2025-05-07T20:33:07.9093054Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9093180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9093270Z         x = x_sign * x_clamp
2025-05-07T20:33:07.9093356Z         x0 = x[:, :D]
2025-05-07T20:33:07.9093438Z         x1 = x[:, D:]
2025-05-07T20:33:07.9093514Z     
2025-05-07T20:33:07.9093609Z         if contiguous:
2025-05-07T20:33:07.9093740Z             x0 = x0.contiguous()
2025-05-07T20:33:07.9093837Z             x1 = x1.contiguous()
2025-05-07T20:33:07.9093908Z     
2025-05-07T20:33:07.9094000Z         if scale_ub is not None:
2025-05-07T20:33:07.9094118Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.9094255Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.9094332Z             )
2025-05-07T20:33:07.9094419Z         else:
2025-05-07T20:33:07.9094515Z             scale_ub_tensor = None
2025-05-07T20:33:07.9094594Z     
2025-05-07T20:33:07.9094733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.9094829Z             op = silu_mul_quant
2025-05-07T20:33:07.9094920Z             if compiled:
2025-05-07T20:33:07.9095027Z                 op = torch.compile(op)
2025-05-07T20:33:07.9095134Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9095210Z     
2025-05-07T20:33:07.9095308Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.9095315Z 
2025-05-07T20:33:07.9095415Z moe/activation_test.py:117: 
2025-05-07T20:33:07.9095556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9095659Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.9095759Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9096311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.9096410Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.9096799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.9097047Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.9097416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.9097519Z     kernel = self.compile(
2025-05-07T20:33:07.9097937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.9098125Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.9098312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9098317Z 
2025-05-07T20:33:07.9098533Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7d08eb0>
2025-05-07T20:33:07.9099393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.9099951Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7cdca60>}
2025-05-07T20:33:07.9100776Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.9100978Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7bac370>
2025-05-07T20:33:07.9100982Z 
2025-05-07T20:33:07.9101154Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.9101438Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.9101586Z                            module_map=module_map)
2025-05-07T20:33:07.9101752Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.9101896Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.9101975Z E       ^
2025-05-07T20:33:07.9102361Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.9102366Z 
2025-05-07T20:33:07.9102812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.9102855Z 
2025-05-07T20:33:07.9102959Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9103208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9103291Z     T=128,
2025-05-07T20:33:07.9103373Z     D=7168,
2025-05-07T20:33:07.9103468Z     scale_ub=None,
2025-05-07T20:33:07.9103556Z     contiguous=True,
2025-05-07T20:33:07.9103649Z     compiled=False,
2025-05-07T20:33:07.9103730Z )
2025-05-07T20:33:07.9103968Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9104158Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.9104162Z 
2025-05-07T20:33:07.9104246Z     @given(
2025-05-07T20:33:07.9104370Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9104480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9104601Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9104725Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9104850Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9104930Z     )
2025-05-07T20:33:07.9105201Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9105301Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9105382Z         self,
2025-05-07T20:33:07.9105469Z         T: int,
2025-05-07T20:33:07.9105550Z         D: int,
2025-05-07T20:33:07.9105655Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9105754Z         contiguous: bool,
2025-05-07T20:33:07.9105847Z         compiled: bool,
2025-05-07T20:33:07.9105928Z     ) -> None:
2025-05-07T20:33:07.9106036Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9106114Z     
2025-05-07T20:33:07.9106291Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9106377Z     
2025-05-07T20:33:07.9106474Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9106613Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9106707Z         x = x_sign * x_clamp
2025-05-07T20:33:07.9106792Z         x0 = x[:, :D]
2025-05-07T20:33:07.9106954Z         x1 = x[:, D:]
2025-05-07T20:33:07.9107035Z     
2025-05-07T20:33:07.9107117Z         if contiguous:
2025-05-07T20:33:07.9107215Z             x0 = x0.contiguous()
2025-05-07T20:33:07.9107305Z             x1 = x1.contiguous()
2025-05-07T20:33:07.9107382Z     
2025-05-07T20:33:07.9107485Z         if scale_ub is not None:
2025-05-07T20:33:07.9107591Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.9107730Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.9107815Z             )
2025-05-07T20:33:07.9107890Z         else:
2025-05-07T20:33:07.9107999Z             scale_ub_tensor = None
2025-05-07T20:33:07.9108074Z     
2025-05-07T20:33:07.9108205Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.9108303Z             op = silu_mul_quant
2025-05-07T20:33:07.9108389Z             if compiled:
2025-05-07T20:33:07.9108490Z                 op = torch.compile(op)
2025-05-07T20:33:07.9108608Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9108682Z     
2025-05-07T20:33:07.9108774Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.9108778Z 
2025-05-07T20:33:07.9108880Z moe/activation_test.py:117: 
2025-05-07T20:33:07.9109015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9109172Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.9109277Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9110004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.9110114Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.9110504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.9110788Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.9111168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.9111265Z     kernel = self.compile(
2025-05-07T20:33:07.9111687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.9111872Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.9112011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9112018Z 
2025-05-07T20:33:07.9112242Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7b8e3d0>
2025-05-07T20:33:07.9113095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.9113660Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7d6f790>}
2025-05-07T20:33:07.9114478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.9114681Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7d77b70>
2025-05-07T20:33:07.9114686Z 
2025-05-07T20:33:07.9114868Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.9115152Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.9115272Z                            module_map=module_map)
2025-05-07T20:33:07.9115439Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.9115545Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.9115634Z E       ^
2025-05-07T20:33:07.9116064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.9116069Z 
2025-05-07T20:33:07.9116522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.9116527Z 
2025-05-07T20:33:07.9116631Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9116868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9116956Z     T=2048,
2025-05-07T20:33:07.9117035Z     D=7168,
2025-05-07T20:33:07.9117119Z     scale_ub=1200.0,
2025-05-07T20:33:07.9117213Z     contiguous=True,
2025-05-07T20:33:07.9117299Z     compiled=False,
2025-05-07T20:33:07.9117375Z )
2025-05-07T20:33:07.9117610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9117798Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.9117802Z 
2025-05-07T20:33:07.9117894Z     @given(
2025-05-07T20:33:07.9118020Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9118127Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9118254Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9118377Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9118493Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9118580Z     )
2025-05-07T20:33:07.9118882Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9119013Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9119096Z         self,
2025-05-07T20:33:07.9119174Z         T: int,
2025-05-07T20:33:07.9119256Z         D: int,
2025-05-07T20:33:07.9119356Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9119446Z         contiguous: bool,
2025-05-07T20:33:07.9119575Z         compiled: bool,
2025-05-07T20:33:07.9119654Z     ) -> None:
2025-05-07T20:33:07.9119749Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9119826Z     
2025-05-07T20:33:07.9120003Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9121976Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9121990Z 
2025-05-07T20:33:07.9122107Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9122111Z 
2025-05-07T20:33:07.9122218Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9122455Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9122534Z     T=1,
2025-05-07T20:33:07.9122620Z     D=5120,
2025-05-07T20:33:07.9122712Z     scale_ub=1200.0,
2025-05-07T20:33:07.9122815Z     contiguous=True,
2025-05-07T20:33:07.9122917Z     compiled=False,
2025-05-07T20:33:07.9123003Z )
2025-05-07T20:33:07.9123231Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9123414Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.9123419Z 
2025-05-07T20:33:07.9123503Z     @given(
2025-05-07T20:33:07.9123625Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9123735Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9123854Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9123980Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9124104Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9124183Z     )
2025-05-07T20:33:07.9124451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9124590Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9124670Z         self,
2025-05-07T20:33:07.9124754Z         T: int,
2025-05-07T20:33:07.9124833Z         D: int,
2025-05-07T20:33:07.9124931Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9125027Z         contiguous: bool,
2025-05-07T20:33:07.9125111Z         compiled: bool,
2025-05-07T20:33:07.9125192Z     ) -> None:
2025-05-07T20:33:07.9125299Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9125379Z     
2025-05-07T20:33:07.9125554Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9125639Z     
2025-05-07T20:33:07.9125736Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9125873Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9125969Z         x = x_sign * x_clamp
2025-05-07T20:33:07.9126053Z         x0 = x[:, :D]
2025-05-07T20:33:07.9126143Z         x1 = x[:, D:]
2025-05-07T20:33:07.9126217Z     
2025-05-07T20:33:07.9126307Z         if contiguous:
2025-05-07T20:33:07.9126412Z             x0 = x0.contiguous()
2025-05-07T20:33:07.9126505Z             x1 = x1.contiguous()
2025-05-07T20:33:07.9126596Z     
2025-05-07T20:33:07.9131206Z         if scale_ub is not None:
2025-05-07T20:33:07.9131329Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.9131551Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.9131640Z             )
2025-05-07T20:33:07.9131761Z         else:
2025-05-07T20:33:07.9131866Z             scale_ub_tensor = None
2025-05-07T20:33:07.9131939Z     
2025-05-07T20:33:07.9132077Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.9132178Z             op = silu_mul_quant
2025-05-07T20:33:07.9132266Z             if compiled:
2025-05-07T20:33:07.9132368Z                 op = torch.compile(op)
2025-05-07T20:33:07.9132527Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9132598Z     
2025-05-07T20:33:07.9132691Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.9132696Z 
2025-05-07T20:33:07.9132811Z moe/activation_test.py:117: 
2025-05-07T20:33:07.9132952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9133067Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.9133172Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9133738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.9133852Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.9134247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.9134486Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.9134869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.9134968Z     kernel = self.compile(
2025-05-07T20:33:07.9135399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.9135583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.9135718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9135723Z 
2025-05-07T20:33:07.9135948Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7d78880>
2025-05-07T20:33:07.9136804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.9137364Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b7ca3040>}
2025-05-07T20:33:07.9138231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.9138434Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7cbaaf0>
2025-05-07T20:33:07.9138439Z 
2025-05-07T20:33:07.9138621Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.9138902Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.9139026Z                            module_map=module_map)
2025-05-07T20:33:07.9139194Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.9139296Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.9139384Z E       ^
2025-05-07T20:33:07.9139771Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.9139779Z 
2025-05-07T20:33:07.9140241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.9140246Z 
2025-05-07T20:33:07.9140351Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9140585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9140677Z     T=2048,
2025-05-07T20:33:07.9140758Z     D=5120,
2025-05-07T20:33:07.9140885Z     scale_ub=None,
2025-05-07T20:33:07.9140983Z     contiguous=True,
2025-05-07T20:33:07.9141110Z     compiled=False,
2025-05-07T20:33:07.9141186Z )
2025-05-07T20:33:07.9141428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9141611Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.9141615Z 
2025-05-07T20:33:07.9141699Z     @given(
2025-05-07T20:33:07.9141888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9141988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9142115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9142234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9142349Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9142436Z     )
2025-05-07T20:33:07.9142697Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9142798Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9142887Z         self,
2025-05-07T20:33:07.9142974Z         T: int,
2025-05-07T20:33:07.9143062Z         D: int,
2025-05-07T20:33:07.9143165Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9143259Z         contiguous: bool,
2025-05-07T20:33:07.9143357Z         compiled: bool,
2025-05-07T20:33:07.9143444Z     ) -> None:
2025-05-07T20:33:07.9143545Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9143632Z     
2025-05-07T20:33:07.9143810Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9143890Z     
2025-05-07T20:33:07.9143993Z >       x_sign = torch.sign(x)
2025-05-07T20:33:07.9145976Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9145985Z 
2025-05-07T20:33:07.9146110Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:07.9146115Z 
2025-05-07T20:33:07.9146220Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9146466Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9146546Z     T=16384,
2025-05-07T20:33:07.9146623Z     D=5120,
2025-05-07T20:33:07.9146758Z     scale_ub=None,
2025-05-07T20:33:07.9146845Z     contiguous=True,
2025-05-07T20:33:07.9146938Z     compiled=False,
2025-05-07T20:33:07.9147013Z )
2025-05-07T20:33:07.9147240Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9147427Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.9147433Z 
2025-05-07T20:33:07.9147510Z     @given(
2025-05-07T20:33:07.9147632Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9147743Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9147862Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9147987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9148103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9148183Z     )
2025-05-07T20:33:07.9148451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9148547Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9148625Z         self,
2025-05-07T20:33:07.9148707Z         T: int,
2025-05-07T20:33:07.9148784Z         D: int,
2025-05-07T20:33:07.9148884Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9148978Z         contiguous: bool,
2025-05-07T20:33:07.9149065Z         compiled: bool,
2025-05-07T20:33:07.9149144Z     ) -> None:
2025-05-07T20:33:07.9149293Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9149401Z     
2025-05-07T20:33:07.9149573Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9151685Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9151733Z 
2025-05-07T20:33:07.9151860Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9151864Z 
2025-05-07T20:33:07.9151968Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9152201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9152288Z     T=4096,
2025-05-07T20:33:07.9152366Z     D=5120,
2025-05-07T20:33:07.9152447Z     scale_ub=None,
2025-05-07T20:33:07.9152538Z     contiguous=True,
2025-05-07T20:33:07.9152624Z     compiled=False,
2025-05-07T20:33:07.9152699Z )
2025-05-07T20:33:07.9152937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9153120Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.9153124Z 
2025-05-07T20:33:07.9153206Z     @given(
2025-05-07T20:33:07.9153327Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9153426Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9153548Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9153666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9153779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9153864Z     )
2025-05-07T20:33:07.9154124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9154221Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9154305Z         self,
2025-05-07T20:33:07.9154383Z         T: int,
2025-05-07T20:33:07.9154466Z         D: int,
2025-05-07T20:33:07.9154568Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9154657Z         contiguous: bool,
2025-05-07T20:33:07.9154749Z         compiled: bool,
2025-05-07T20:33:07.9154830Z     ) -> None:
2025-05-07T20:33:07.9154923Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9155003Z     
2025-05-07T20:33:07.9155218Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9157174Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9157189Z 
2025-05-07T20:33:07.9157306Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9157312Z 
2025-05-07T20:33:07.9157415Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9157656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9157737Z     T=2048,
2025-05-07T20:33:07.9157831Z     D=5120,
2025-05-07T20:33:07.9157918Z     scale_ub=None,
2025-05-07T20:33:07.9158008Z     contiguous=False,
2025-05-07T20:33:07.9158102Z     compiled=False,
2025-05-07T20:33:07.9158179Z )
2025-05-07T20:33:07.9158408Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9158637Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.9158676Z 
2025-05-07T20:33:07.9158755Z     @given(
2025-05-07T20:33:07.9158875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9158980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9159095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9159220Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9159374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9159449Z     )
2025-05-07T20:33:07.9159722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9159821Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9159902Z         self,
2025-05-07T20:33:07.9159990Z         T: int,
2025-05-07T20:33:07.9160069Z         D: int,
2025-05-07T20:33:07.9160171Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9160271Z         contiguous: bool,
2025-05-07T20:33:07.9160363Z         compiled: bool,
2025-05-07T20:33:07.9160443Z     ) -> None:
2025-05-07T20:33:07.9160557Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9160635Z     
2025-05-07T20:33:07.9160817Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9162774Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9162784Z 
2025-05-07T20:33:07.9162927Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9162931Z 
2025-05-07T20:33:07.9163059Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9163295Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9163380Z     T=4096,
2025-05-07T20:33:07.9163457Z     D=7168,
2025-05-07T20:33:07.9163548Z     scale_ub=None,
2025-05-07T20:33:07.9163639Z     contiguous=True,
2025-05-07T20:33:07.9163725Z     compiled=True,
2025-05-07T20:33:07.9163800Z )
2025-05-07T20:33:07.9164041Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9164216Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.9164221Z 
2025-05-07T20:33:07.9164345Z     @given(
2025-05-07T20:33:07.9164464Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9164563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9164684Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9164801Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9164918Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9165003Z     )
2025-05-07T20:33:07.9165262Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9165357Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9165441Z         self,
2025-05-07T20:33:07.9165519Z         T: int,
2025-05-07T20:33:07.9165601Z         D: int,
2025-05-07T20:33:07.9165698Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9165789Z         contiguous: bool,
2025-05-07T20:33:07.9165879Z         compiled: bool,
2025-05-07T20:33:07.9165958Z     ) -> None:
2025-05-07T20:33:07.9166054Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9166135Z     
2025-05-07T20:33:07.9166308Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9168303Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9168366Z 
2025-05-07T20:33:07.9168523Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9168528Z 
2025-05-07T20:33:07.9168631Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9168872Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9168956Z     T=2048,
2025-05-07T20:33:07.9169043Z     D=5120,
2025-05-07T20:33:07.9169129Z     scale_ub=1200.0,
2025-05-07T20:33:07.9169222Z     contiguous=False,
2025-05-07T20:33:07.9169317Z     compiled=False,
2025-05-07T20:33:07.9169395Z )
2025-05-07T20:33:07.9169627Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9169822Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.9169827Z 
2025-05-07T20:33:07.9169907Z     @given(
2025-05-07T20:33:07.9170030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9170138Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9170256Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9170386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9170507Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9170586Z     )
2025-05-07T20:33:07.9170855Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9170953Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9171033Z         self,
2025-05-07T20:33:07.9171120Z         T: int,
2025-05-07T20:33:07.9171199Z         D: int,
2025-05-07T20:33:07.9171301Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9171402Z         contiguous: bool,
2025-05-07T20:33:07.9171494Z         compiled: bool,
2025-05-07T20:33:07.9171575Z     ) -> None:
2025-05-07T20:33:07.9171680Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9171757Z     
2025-05-07T20:33:07.9171941Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9173982Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9173991Z 
2025-05-07T20:33:07.9174118Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9174122Z 
2025-05-07T20:33:07.9174226Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9174457Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9174536Z     T=4096,
2025-05-07T20:33:07.9174612Z     D=7168,
2025-05-07T20:33:07.9174695Z     scale_ub=1200.0,
2025-05-07T20:33:07.9174786Z     contiguous=True,
2025-05-07T20:33:07.9174877Z     compiled=False,
2025-05-07T20:33:07.9174951Z )
2025-05-07T20:33:07.9175186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9175368Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.9175372Z 
2025-05-07T20:33:07.9175461Z     @given(
2025-05-07T20:33:07.9175582Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9175684Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9175807Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9175972Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9176148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9176230Z     )
2025-05-07T20:33:07.9176491Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9176586Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9176667Z         self,
2025-05-07T20:33:07.9176742Z         T: int,
2025-05-07T20:33:07.9176865Z         D: int,
2025-05-07T20:33:07.9176965Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9177055Z         contiguous: bool,
2025-05-07T20:33:07.9177147Z         compiled: bool,
2025-05-07T20:33:07.9177233Z     ) -> None:
2025-05-07T20:33:07.9177333Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9177418Z     
2025-05-07T20:33:07.9177595Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9179559Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9179571Z 
2025-05-07T20:33:07.9179688Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9179692Z 
2025-05-07T20:33:07.9179795Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9180034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9180112Z     T=16384,
2025-05-07T20:33:07.9180200Z     D=7168,
2025-05-07T20:33:07.9180282Z     scale_ub=None,
2025-05-07T20:33:07.9180369Z     contiguous=False,
2025-05-07T20:33:07.9180459Z     compiled=True,
2025-05-07T20:33:07.9180534Z )
2025-05-07T20:33:07.9180758Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9180949Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.9180954Z 
2025-05-07T20:33:07.9181037Z     @given(
2025-05-07T20:33:07.9181158Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9181265Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9181385Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9181511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9181673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9181752Z     )
2025-05-07T20:33:07.9182017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9182108Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9182188Z         self,
2025-05-07T20:33:07.9182273Z         T: int,
2025-05-07T20:33:07.9182356Z         D: int,
2025-05-07T20:33:07.9182457Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9182557Z         contiguous: bool,
2025-05-07T20:33:07.9182646Z         compiled: bool,
2025-05-07T20:33:07.9182726Z     ) -> None:
2025-05-07T20:33:07.9183293Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9183393Z     
2025-05-07T20:33:07.9183573Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9185707Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9185714Z 
2025-05-07T20:33:07.9185837Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9185903Z 
2025-05-07T20:33:07.9186012Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9186246Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9186329Z     T=4096,
2025-05-07T20:33:07.9186410Z     D=7168,
2025-05-07T20:33:07.9186497Z     scale_ub=None,
2025-05-07T20:33:07.9186655Z     contiguous=True,
2025-05-07T20:33:07.9186741Z     compiled=False,
2025-05-07T20:33:07.9186814Z )
2025-05-07T20:33:07.9187049Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9187226Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.9187230Z 
2025-05-07T20:33:07.9187310Z     @given(
2025-05-07T20:33:07.9187426Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9187524Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9187651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9187770Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9187881Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9187957Z     )
2025-05-07T20:33:07.9188212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9188312Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9188402Z         self,
2025-05-07T20:33:07.9188482Z         T: int,
2025-05-07T20:33:07.9188566Z         D: int,
2025-05-07T20:33:07.9188665Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9188757Z         contiguous: bool,
2025-05-07T20:33:07.9188851Z         compiled: bool,
2025-05-07T20:33:07.9188931Z     ) -> None:
2025-05-07T20:33:07.9189027Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9189106Z     
2025-05-07T20:33:07.9189279Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9191339Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9191350Z 
2025-05-07T20:33:07.9191465Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9191538Z 
2025-05-07T20:33:07.9191644Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9191908Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9192014Z     T=16384,
2025-05-07T20:33:07.9192122Z     D=7168,
2025-05-07T20:33:07.9192403Z     scale_ub=None,
2025-05-07T20:33:07.9192713Z     contiguous=True,
2025-05-07T20:33:07.9193072Z     compiled=False,
2025-05-07T20:33:07.9193277Z )
2025-05-07T20:33:07.9193603Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9194131Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.9194427Z 
2025-05-07T20:33:07.9194504Z     @given(
2025-05-07T20:33:07.9194739Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9195068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9195379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9195724Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9196064Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9196374Z     )
2025-05-07T20:33:07.9196740Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9197211Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9197618Z         self,
2025-05-07T20:33:07.9197826Z         T: int,
2025-05-07T20:33:07.9198061Z         D: int,
2025-05-07T20:33:07.9198279Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9198555Z         contiguous: bool,
2025-05-07T20:33:07.9198796Z         compiled: bool,
2025-05-07T20:33:07.9199017Z     ) -> None:
2025-05-07T20:33:07.9199236Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9199485Z     
2025-05-07T20:33:07.9199757Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9202057Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9204138Z 
2025-05-07T20:33:07.9204253Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9204471Z 
2025-05-07T20:33:07.9204581Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9205016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9205438Z     T=16384,
2025-05-07T20:33:07.9205637Z     D=7168,
2025-05-07T20:33:07.9205828Z     scale_ub=1200.0,
2025-05-07T20:33:07.9206047Z     contiguous=True,
2025-05-07T20:33:07.9206273Z     compiled=False,
2025-05-07T20:33:07.9206477Z )
2025-05-07T20:33:07.9206808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9207334Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.9207630Z 
2025-05-07T20:33:07.9207713Z     @given(
2025-05-07T20:33:07.9207938Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9208270Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9208595Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9208948Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9209287Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9209591Z     )
2025-05-07T20:33:07.9209961Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9210425Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9210671Z         self,
2025-05-07T20:33:07.9210862Z         T: int,
2025-05-07T20:33:07.9211052Z         D: int,
2025-05-07T20:33:07.9211317Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9211594Z         contiguous: bool,
2025-05-07T20:33:07.9211831Z         compiled: bool,
2025-05-07T20:33:07.9212051Z     ) -> None:
2025-05-07T20:33:07.9212275Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9212521Z     
2025-05-07T20:33:07.9212801Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9215045Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9217106Z 
2025-05-07T20:33:07.9217225Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9217445Z 
2025-05-07T20:33:07.9217556Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9217979Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9218404Z     T=128,
2025-05-07T20:33:07.9218635Z     D=5120,
2025-05-07T20:33:07.9218823Z     scale_ub=1200.0,
2025-05-07T20:33:07.9219091Z     contiguous=False,
2025-05-07T20:33:07.9219318Z     compiled=False,
2025-05-07T20:33:07.9219513Z )
2025-05-07T20:33:07.9219847Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9220377Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.9220666Z 
2025-05-07T20:33:07.9220792Z     @given(
2025-05-07T20:33:07.9221015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9221334Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9221656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9221990Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9222333Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9222628Z     )
2025-05-07T20:33:07.9222981Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9223453Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9223699Z         self,
2025-05-07T20:33:07.9223896Z         T: int,
2025-05-07T20:33:07.9224085Z         D: int,
2025-05-07T20:33:07.9224305Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9224585Z         contiguous: bool,
2025-05-07T20:33:07.9224822Z         compiled: bool,
2025-05-07T20:33:07.9225054Z     ) -> None:
2025-05-07T20:33:07.9225277Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9225529Z     
2025-05-07T20:33:07.9225810Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9226166Z     
2025-05-07T20:33:07.9226356Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9226654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9226976Z         x = x_sign * x_clamp
2025-05-07T20:33:07.9227223Z         x0 = x[:, :D]
2025-05-07T20:33:07.9227441Z         x1 = x[:, D:]
2025-05-07T20:33:07.9227654Z     
2025-05-07T20:33:07.9227835Z         if contiguous:
2025-05-07T20:33:07.9228078Z             x0 = x0.contiguous()
2025-05-07T20:33:07.9228354Z             x1 = x1.contiguous()
2025-05-07T20:33:07.9228599Z     
2025-05-07T20:33:07.9228794Z         if scale_ub is not None:
2025-05-07T20:33:07.9229075Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.9229421Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.9229735Z             )
2025-05-07T20:33:07.9230072Z         else:
2025-05-07T20:33:07.9230287Z             scale_ub_tensor = None
2025-05-07T20:33:07.9230545Z     
2025-05-07T20:33:07.9230776Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.9231153Z             op = silu_mul_quant
2025-05-07T20:33:07.9231403Z             if compiled:
2025-05-07T20:33:07.9231654Z                 op = torch.compile(op)
2025-05-07T20:33:07.9231961Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9232238Z     
2025-05-07T20:33:07.9232429Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.9232601Z 
2025-05-07T20:33:07.9232703Z moe/activation_test.py:117: 
2025-05-07T20:33:07.9233006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9233353Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.9233639Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9234380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.9235123Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.9235699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.9236438Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.9237143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.9237711Z     kernel = self.compile(
2025-05-07T20:33:07.9238362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.9239109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.9239517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9239771Z 
2025-05-07T20:33:07.9239988Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b7a370d0>
2025-05-07T20:33:07.9241210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.9242720Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b79c5ca0>}
2025-05-07T20:33:07.9244190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.9245295Z context = <triton._C.libtriton.ir.context object at 0x7fd1b7985470>
2025-05-07T20:33:07.9245606Z 
2025-05-07T20:33:07.9245777Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.9246330Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.9246821Z                            module_map=module_map)
2025-05-07T20:33:07.9247199Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.9247567Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.9247833Z E       ^
2025-05-07T20:33:07.9248324Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.9248820Z 
2025-05-07T20:33:07.9249271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.9249830Z 
2025-05-07T20:33:07.9249940Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9250372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9250790Z     T=2048,
2025-05-07T20:33:07.9250978Z     D=7168,
2025-05-07T20:33:07.9251168Z     scale_ub=None,
2025-05-07T20:33:07.9251381Z     contiguous=False,
2025-05-07T20:33:07.9251609Z     compiled=False,
2025-05-07T20:33:07.9251810Z )
2025-05-07T20:33:07.9252129Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9252712Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.9253047Z 
2025-05-07T20:33:07.9253127Z     @given(
2025-05-07T20:33:07.9253352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9253676Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9253993Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9254333Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9254669Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9254960Z     )
2025-05-07T20:33:07.9255330Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9255790Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9256036Z         self,
2025-05-07T20:33:07.9256231Z         T: int,
2025-05-07T20:33:07.9256425Z         D: int,
2025-05-07T20:33:07.9256641Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9256915Z         contiguous: bool,
2025-05-07T20:33:07.9257150Z         compiled: bool,
2025-05-07T20:33:07.9257371Z     ) -> None:
2025-05-07T20:33:07.9257583Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9257819Z     
2025-05-07T20:33:07.9258101Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9260394Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9262528Z 
2025-05-07T20:33:07.9262650Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9262868Z 
2025-05-07T20:33:07.9262976Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9263395Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9263821Z     T=128,
2025-05-07T20:33:07.9264009Z     D=7168,
2025-05-07T20:33:07.9264192Z     scale_ub=1200.0,
2025-05-07T20:33:07.9264423Z     contiguous=True,
2025-05-07T20:33:07.9264655Z     compiled=True,
2025-05-07T20:33:07.9264862Z )
2025-05-07T20:33:07.9265201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9265727Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.9266013Z 
2025-05-07T20:33:07.9266100Z     @given(
2025-05-07T20:33:07.9266322Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9266651Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9266967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9267301Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9267643Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9267942Z     )
2025-05-07T20:33:07.9268306Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9268776Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9269019Z         self,
2025-05-07T20:33:07.9269220Z         T: int,
2025-05-07T20:33:07.9269410Z         D: int,
2025-05-07T20:33:07.9269631Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9270005Z         contiguous: bool,
2025-05-07T20:33:07.9270240Z         compiled: bool,
2025-05-07T20:33:07.9270467Z     ) -> None:
2025-05-07T20:33:07.9270690Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9270936Z     
2025-05-07T20:33:07.9271224Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9277119Z     
2025-05-07T20:33:07.9277340Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9277659Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9278101Z         x = x_sign * x_clamp
2025-05-07T20:33:07.9278370Z         x0 = x[:, :D]
2025-05-07T20:33:07.9278594Z         x1 = x[:, D:]
2025-05-07T20:33:07.9278823Z     
2025-05-07T20:33:07.9279032Z         if contiguous:
2025-05-07T20:33:07.9279275Z             x0 = x0.contiguous()
2025-05-07T20:33:07.9279558Z             x1 = x1.contiguous()
2025-05-07T20:33:07.9279820Z     
2025-05-07T20:33:07.9280027Z         if scale_ub is not None:
2025-05-07T20:33:07.9280327Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.9280691Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.9281016Z             )
2025-05-07T20:33:07.9281224Z         else:
2025-05-07T20:33:07.9281448Z             scale_ub_tensor = None
2025-05-07T20:33:07.9281710Z     
2025-05-07T20:33:07.9281958Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.9282301Z             op = silu_mul_quant
2025-05-07T20:33:07.9282573Z             if compiled:
2025-05-07T20:33:07.9283318Z                 op = torch.compile(op)
2025-05-07T20:33:07.9283672Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9283970Z     
2025-05-07T20:33:07.9284164Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.9284346Z 
2025-05-07T20:33:07.9284450Z moe/activation_test.py:117: 
2025-05-07T20:33:07.9284918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9285273Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.9285637Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.9286244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.9286857Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.9287569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.9288393Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.9288977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.9289712Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.9290428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.9291000Z     kernel = self.compile(
2025-05-07T20:33:07.9291578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.9292280Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.9292701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.9292979Z 
2025-05-07T20:33:07.9293219Z self = <triton.compiler.compiler.ASTSource object at 0x7fd1b790b6d0>
2025-05-07T20:33:07.9294406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.9295940Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fd1abe068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fd1b793e0d0>}
2025-05-07T20:33:07.9297416Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.9298537Z context = <triton._C.libtriton.ir.context object at 0x7fd1b78301b0>
2025-05-07T20:33:07.9298857Z 
2025-05-07T20:33:07.9299027Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.9299581Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.9300070Z                            module_map=module_map)
2025-05-07T20:33:07.9300521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.9300898Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.9301170Z E       ^
2025-05-07T20:33:07.9301671Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.9302177Z 
2025-05-07T20:33:07.9302636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.9303244Z 
2025-05-07T20:33:07.9303358Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9303789Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9304205Z     T=128,
2025-05-07T20:33:07.9304394Z     D=7168,
2025-05-07T20:33:07.9304591Z     scale_ub=1200.0,
2025-05-07T20:33:07.9304807Z     contiguous=True,
2025-05-07T20:33:07.9305030Z     compiled=False,
2025-05-07T20:33:07.9305237Z )
2025-05-07T20:33:07.9305559Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9306076Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.9306361Z 
2025-05-07T20:33:07.9306443Z     @given(
2025-05-07T20:33:07.9306667Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9307042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9307360Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9307732Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9308071Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9308367Z     )
2025-05-07T20:33:07.9308738Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9309197Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9309486Z         self,
2025-05-07T20:33:07.9309683Z         T: int,
2025-05-07T20:33:07.9310020Z         D: int,
2025-05-07T20:33:07.9310241Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9310521Z         contiguous: bool,
2025-05-07T20:33:07.9310762Z         compiled: bool,
2025-05-07T20:33:07.9310991Z     ) -> None:
2025-05-07T20:33:07.9311213Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9311453Z     
2025-05-07T20:33:07.9311732Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9312096Z     
2025-05-07T20:33:07.9312286Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9312594Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9314859Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9316917Z 
2025-05-07T20:33:07.9317033Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.9317254Z 
2025-05-07T20:33:07.9317365Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9317790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9318222Z     T=128,
2025-05-07T20:33:07.9318418Z     D=5120,
2025-05-07T20:33:07.9318610Z     scale_ub=1200.0,
2025-05-07T20:33:07.9318844Z     contiguous=True,
2025-05-07T20:33:07.9319073Z     compiled=True,
2025-05-07T20:33:07.9319274Z )
2025-05-07T20:33:07.9319611Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9320135Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.9320423Z 
2025-05-07T20:33:07.9320505Z     @given(
2025-05-07T20:33:07.9320731Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9321143Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9321460Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9321792Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9322130Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9322425Z     )
2025-05-07T20:33:07.9322784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9323260Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9323505Z         self,
2025-05-07T20:33:07.9323698Z         T: int,
2025-05-07T20:33:07.9323891Z         D: int,
2025-05-07T20:33:07.9324111Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9324390Z         contiguous: bool,
2025-05-07T20:33:07.9324633Z         compiled: bool,
2025-05-07T20:33:07.9324866Z     ) -> None:
2025-05-07T20:33:07.9325088Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9325333Z     
2025-05-07T20:33:07.9325619Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9325984Z     
2025-05-07T20:33:07.9326178Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.9326481Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.9328717Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9330849Z 
2025-05-07T20:33:07.9330968Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:07.9331189Z 
2025-05-07T20:33:07.9331300Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.9331725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.9332146Z     T=128,
2025-05-07T20:33:07.9332337Z     D=7168,
2025-05-07T20:33:07.9332524Z     scale_ub=None,
2025-05-07T20:33:07.9332739Z     contiguous=True,
2025-05-07T20:33:07.9332961Z     compiled=True,
2025-05-07T20:33:07.9333154Z )
2025-05-07T20:33:07.9333485Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.9334004Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:07.9334288Z 
2025-05-07T20:33:07.9334373Z     @given(
2025-05-07T20:33:07.9334601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.9334927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.9335246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.9335580Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.9335923Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.9336220Z     )
2025-05-07T20:33:07.9336583Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.9337055Z     def test_silu_mul_quant(
2025-05-07T20:33:07.9337301Z         self,
2025-05-07T20:33:07.9337486Z         T: int,
2025-05-07T20:33:07.9337690Z         D: int,
2025-05-07T20:33:07.9337917Z         scale_ub: Optional[float],
2025-05-07T20:33:07.9338196Z         contiguous: bool,
2025-05-07T20:33:07.9338441Z         compiled: bool,
2025-05-07T20:33:07.9338669Z     ) -> None:
2025-05-07T20:33:07.9338883Z         torch.manual_seed(2025)
2025-05-07T20:33:07.9339133Z     
2025-05-07T20:33:07.9339414Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.9341712Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:07.9343806Z 
2025-05-07T20:33:07.9343930Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:07.9344281Z =============================== warnings summary ===============================
2025-05-07T20:33:07.9344855Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:07.9345597Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:07.9346341Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:07.9347726Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:07.9349028Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:07.9349439Z 
2025-05-07T20:33:07.9349661Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:07.9350342Z ================= 1 failed, 1 deselected, 3 warnings in 19.35s =================
2025-05-07T20:33:09.4854629Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:09.5494604Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:33:09.5494867Z 
2025-05-07T20:33:11.5513144Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:33:13.7047227Z ============================= test session starts ==============================
2025-05-07T20:33:13.7047950Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:33:13.7048503Z cachedir: .pytest_cache
2025-05-07T20:33:13.7049121Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:33:13.7049897Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:33:13.7050322Z plugins: hypothesis-6.131.14
2025-05-07T20:33:15.3321410Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:33:15.5451918Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:33:15.5452339Z run-last-failure: rerun previous 1 failure
2025-05-07T20:33:15.5452581Z 
2025-05-07T20:33:18.2535020Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.2537011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.2537958Z     T=1,
2025-05-07T20:33:18.2538344Z     D=5120,
2025-05-07T20:33:18.2538752Z     scale_ub=None,
2025-05-07T20:33:18.2539202Z     contiguous=True,
2025-05-07T20:33:18.2539653Z     compiled=True,
2025-05-07T20:33:18.2540060Z )
2025-05-07T20:33:18.2540725Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.2541776Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:18.2542339Z 
2025-05-07T20:33:18.2542507Z     @given(
2025-05-07T20:33:18.2542979Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.2543640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.2544719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.2545274Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.2545665Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.2545973Z     )
2025-05-07T20:33:18.2546343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.2546835Z     def test_silu_mul_quant(
2025-05-07T20:33:18.2547095Z         self,
2025-05-07T20:33:18.2547301Z         T: int,
2025-05-07T20:33:18.2547503Z         D: int,
2025-05-07T20:33:18.2547730Z         scale_ub: Optional[float],
2025-05-07T20:33:18.2548017Z         contiguous: bool,
2025-05-07T20:33:18.2548263Z         compiled: bool,
2025-05-07T20:33:18.2548502Z     ) -> None:
2025-05-07T20:33:18.2548725Z         torch.manual_seed(2025)
2025-05-07T20:33:18.2548978Z     
2025-05-07T20:33:18.2549263Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.2549660Z     
2025-05-07T20:33:18.2550073Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.2550374Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.2550707Z         x = x_sign * x_clamp
2025-05-07T20:33:18.2550957Z         x0 = x[:, :D]
2025-05-07T20:33:18.2551176Z         x1 = x[:, D:]
2025-05-07T20:33:18.2551388Z     
2025-05-07T20:33:18.2551581Z         if contiguous:
2025-05-07T20:33:18.2551922Z             x0 = x0.contiguous()
2025-05-07T20:33:18.2552204Z             x1 = x1.contiguous()
2025-05-07T20:33:18.2552552Z     
2025-05-07T20:33:18.2552749Z         if scale_ub is not None:
2025-05-07T20:33:18.2553045Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.2553405Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.2553741Z             )
2025-05-07T20:33:18.2553932Z         else:
2025-05-07T20:33:18.2554229Z             scale_ub_tensor = None
2025-05-07T20:33:18.2554500Z     
2025-05-07T20:33:18.2554732Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.2555070Z             op = silu_mul_quant
2025-05-07T20:33:18.2555333Z             if compiled:
2025-05-07T20:33:18.2555589Z                 op = torch.compile(op)
2025-05-07T20:33:18.2555903Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.2556198Z     
2025-05-07T20:33:18.2556392Z         y_fp8, y_scale = fn()
2025-05-07T20:33:18.2556694Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:18.2557005Z     
2025-05-07T20:33:18.2557244Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.2557602Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:18.2557910Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:18.2558238Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:18.2558621Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:18.2558961Z     
2025-05-07T20:33:18.2559167Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:18.2559374Z 
2025-05-07T20:33:18.2559476Z moe/activation_test.py:126: 
2025-05-07T20:33:18.2559795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.2560157Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:18.2560494Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:18.2561355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:18.2562186Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:18.2562777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.2563512Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.2564260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:18.2565099Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:18.2565962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:18.2566769Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:18.2567555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:18.2568241Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:18.2568872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:18.2569424Z     fn()
2025-05-07T20:33:18.2569960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:18.2570577Z     self.fn.run(
2025-05-07T20:33:18.2571062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.2571629Z     kernel = self.compile(
2025-05-07T20:33:18.2572201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.2572892Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.2573365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.2573653Z 
2025-05-07T20:33:18.2573866Z self = <triton.compiler.compiler.ASTSource object at 0x7f59014b8040>
2025-05-07T20:33:18.2575047Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.2576621Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f59015d99d0>}
2025-05-07T20:33:18.2578089Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.2579201Z context = <triton._C.libtriton.ir.context object at 0x7f5903432770>
2025-05-07T20:33:18.2579505Z 
2025-05-07T20:33:18.2579683Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.2580236Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.2580725Z                            module_map=module_map)
2025-05-07T20:33:18.2581105Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.2581472Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:18.2581739Z E       ^
2025-05-07T20:33:18.2582232Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.2582722Z 
2025-05-07T20:33:18.2583599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.2584156Z 
2025-05-07T20:33:18.2584264Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.2584693Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.2585114Z     T=2048,
2025-05-07T20:33:18.2585306Z     D=5120,
2025-05-07T20:33:18.2585490Z     scale_ub=1200.0,
2025-05-07T20:33:18.2585710Z     contiguous=True,
2025-05-07T20:33:18.2585934Z     compiled=False,
2025-05-07T20:33:18.2586131Z )
2025-05-07T20:33:19.7634156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.7634971Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.7635369Z 
2025-05-07T20:33:19.7635460Z     @given(
2025-05-07T20:33:19.7635700Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.7636320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.7636651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.7637005Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.7637348Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.7637654Z     )
2025-05-07T20:33:19.7638036Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.7638519Z     def test_silu_mul_quant(
2025-05-07T20:33:19.7638775Z         self,
2025-05-07T20:33:19.7638976Z         T: int,
2025-05-07T20:33:19.7639176Z         D: int,
2025-05-07T20:33:19.7639406Z         scale_ub: Optional[float],
2025-05-07T20:33:19.7639693Z         contiguous: bool,
2025-05-07T20:33:19.7639937Z         compiled: bool,
2025-05-07T20:33:19.7640184Z     ) -> None:
2025-05-07T20:33:19.7640405Z         torch.manual_seed(2025)
2025-05-07T20:33:19.7640656Z     
2025-05-07T20:33:19.7640939Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.7641307Z     
2025-05-07T20:33:19.7641503Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.7641794Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.7642122Z         x = x_sign * x_clamp
2025-05-07T20:33:19.7642370Z         x0 = x[:, :D]
2025-05-07T20:33:19.7642585Z         x1 = x[:, D:]
2025-05-07T20:33:19.7642920Z     
2025-05-07T20:33:19.7643118Z         if contiguous:
2025-05-07T20:33:19.7643439Z             x0 = x0.contiguous()
2025-05-07T20:33:19.7643714Z             x1 = x1.contiguous()
2025-05-07T20:33:19.7643975Z     
2025-05-07T20:33:19.7644168Z         if scale_ub is not None:
2025-05-07T20:33:19.7644455Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.7644812Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.7645218Z             )
2025-05-07T20:33:19.7645420Z         else:
2025-05-07T20:33:19.7645641Z             scale_ub_tensor = None
2025-05-07T20:33:19.7645906Z     
2025-05-07T20:33:19.7646152Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.7646491Z             op = silu_mul_quant
2025-05-07T20:33:19.7646758Z             if compiled:
2025-05-07T20:33:19.7647015Z                 op = torch.compile(op)
2025-05-07T20:33:19.7647330Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.7647626Z     
2025-05-07T20:33:19.7647818Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.7648001Z 
2025-05-07T20:33:19.7648103Z moe/activation_test.py:117: 
2025-05-07T20:33:19.7648414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.7648758Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.7649047Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.7649797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.7650555Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.7651127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.7651867Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.7652583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.7653151Z     kernel = self.compile(
2025-05-07T20:33:19.7653730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.7654438Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.7654857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.7655099Z 
2025-05-07T20:33:19.7655318Z self = <triton.compiler.compiler.ASTSource object at 0x7f58f13efac0>
2025-05-07T20:33:19.7656545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.7658071Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58dd9ad5e0>}
2025-05-07T20:33:19.7659542Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.7660649Z context = <triton._C.libtriton.ir.context object at 0x7f5900130a30>
2025-05-07T20:33:19.7660963Z 
2025-05-07T20:33:19.7667450Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.7668069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.7668594Z                            module_map=module_map)
2025-05-07T20:33:19.7668997Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.7669371Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.7669650Z E       ^
2025-05-07T20:33:19.7670312Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.7670879Z 
2025-05-07T20:33:19.7671349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.7671956Z 
2025-05-07T20:33:19.7672060Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.7672495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.7672923Z     T=2048,
2025-05-07T20:33:19.7673110Z     D=5120,
2025-05-07T20:33:19.7673357Z     scale_ub=1200.0,
2025-05-07T20:33:19.7673592Z     contiguous=True,
2025-05-07T20:33:19.7673813Z     compiled=True,
2025-05-07T20:33:19.7674027Z )
2025-05-07T20:33:19.7674368Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.7674898Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.7675188Z 
2025-05-07T20:33:19.7675269Z     @given(
2025-05-07T20:33:19.7675507Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.7675842Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.7676156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.7676504Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.7676852Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.7677153Z     )
2025-05-07T20:33:19.7677523Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.7677996Z     def test_silu_mul_quant(
2025-05-07T20:33:19.7678252Z         self,
2025-05-07T20:33:19.7678445Z         T: int,
2025-05-07T20:33:19.7678650Z         D: int,
2025-05-07T20:33:19.7678875Z         scale_ub: Optional[float],
2025-05-07T20:33:19.7679151Z         contiguous: bool,
2025-05-07T20:33:19.7679401Z         compiled: bool,
2025-05-07T20:33:19.7679631Z     ) -> None:
2025-05-07T20:33:19.7679849Z         torch.manual_seed(2025)
2025-05-07T20:33:19.7680099Z     
2025-05-07T20:33:19.7680380Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.7680732Z     
2025-05-07T20:33:19.7680928Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.7681227Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.7681552Z         x = x_sign * x_clamp
2025-05-07T20:33:19.7681793Z         x0 = x[:, :D]
2025-05-07T20:33:19.7682016Z         x1 = x[:, D:]
2025-05-07T20:33:19.7682231Z     
2025-05-07T20:33:19.7682416Z         if contiguous:
2025-05-07T20:33:19.7682653Z             x0 = x0.contiguous()
2025-05-07T20:33:19.7683197Z             x1 = x1.contiguous()
2025-05-07T20:33:19.7683440Z     
2025-05-07T20:33:19.7683632Z         if scale_ub is not None:
2025-05-07T20:33:19.7684001Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.7684352Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.7684670Z             )
2025-05-07T20:33:19.7684860Z         else:
2025-05-07T20:33:19.7685071Z             scale_ub_tensor = None
2025-05-07T20:33:19.7685320Z     
2025-05-07T20:33:19.7685559Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.7685893Z             op = silu_mul_quant
2025-05-07T20:33:19.7686143Z             if compiled:
2025-05-07T20:33:19.7686396Z                 op = torch.compile(op)
2025-05-07T20:33:19.7686703Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.7686982Z     
2025-05-07T20:33:19.7687175Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.7687466Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.7687765Z     
2025-05-07T20:33:19.7688004Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.7688356Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.7688658Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.7688978Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.7689356Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.7689684Z     
2025-05-07T20:33:19.7689957Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.7690172Z 
2025-05-07T20:33:19.7690332Z moe/activation_test.py:126: 
2025-05-07T20:33:19.7690645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.7690991Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.7691331Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.7692180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.7693068Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.7693646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.7694384Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.7695127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.7695909Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.7696715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:19.7697521Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.7698311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.7698994Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.7699639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.7700197Z     fn()
2025-05-07T20:33:19.7700735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.7701350Z     self.fn.run(
2025-05-07T20:33:19.7701841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.7702410Z     kernel = self.compile(
2025-05-07T20:33:19.7702977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.7703674Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.7704085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.7704328Z 
2025-05-07T20:33:19.7704544Z self = <triton.compiler.compiler.ASTSource object at 0x7f590018e8b0>
2025-05-07T20:33:19.7705756Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.7707266Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f59000565e0>}
2025-05-07T20:33:19.7708736Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.7709923Z context = <triton._C.libtriton.ir.context object at 0x7f58ffed3f30>
2025-05-07T20:33:19.7710231Z 
2025-05-07T20:33:19.7710407Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.7710953Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.7711447Z                            module_map=module_map)
2025-05-07T20:33:19.7711822Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.7712178Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.7712447Z E       ^
2025-05-07T20:33:19.7712983Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.7713543Z 
2025-05-07T20:33:19.7713999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.7714553Z 
2025-05-07T20:33:19.7714650Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.7715079Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.7715544Z     T=16384,
2025-05-07T20:33:19.7715730Z     D=7168,
2025-05-07T20:33:19.7715919Z     scale_ub=1200.0,
2025-05-07T20:33:19.7716142Z     contiguous=False,
2025-05-07T20:33:19.7716364Z     compiled=False,
2025-05-07T20:33:19.7716566Z )
2025-05-07T20:33:21.1012400Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.1013989Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:21.1014835Z 
2025-05-07T20:33:21.1015009Z     @given(
2025-05-07T20:33:21.1015516Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.1016057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.1016429Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.1016781Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.1017136Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.1017435Z     )
2025-05-07T20:33:21.1017814Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.1018290Z     def test_silu_mul_quant(
2025-05-07T20:33:21.1018541Z         self,
2025-05-07T20:33:21.1018748Z         T: int,
2025-05-07T20:33:21.1018961Z         D: int,
2025-05-07T20:33:21.1019183Z         scale_ub: Optional[float],
2025-05-07T20:33:21.1019474Z         contiguous: bool,
2025-05-07T20:33:21.1019726Z         compiled: bool,
2025-05-07T20:33:21.1019961Z     ) -> None:
2025-05-07T20:33:21.1020188Z         torch.manual_seed(2025)
2025-05-07T20:33:21.1020448Z     
2025-05-07T20:33:21.1020726Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.1021102Z     
2025-05-07T20:33:21.1021306Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.1021613Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.1021938Z         x = x_sign * x_clamp
2025-05-07T20:33:21.1022195Z         x0 = x[:, :D]
2025-05-07T20:33:21.1022425Z         x1 = x[:, D:]
2025-05-07T20:33:21.1022641Z     
2025-05-07T20:33:21.1022837Z         if contiguous:
2025-05-07T20:33:21.1023080Z             x0 = x0.contiguous()
2025-05-07T20:33:21.1023347Z             x1 = x1.contiguous()
2025-05-07T20:33:21.1023912Z     
2025-05-07T20:33:21.1024116Z         if scale_ub is not None:
2025-05-07T20:33:21.1024400Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.1024756Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.1025091Z             )
2025-05-07T20:33:21.1025289Z         else:
2025-05-07T20:33:21.1025511Z             scale_ub_tensor = None
2025-05-07T20:33:21.1025782Z     
2025-05-07T20:33:21.1026016Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.1026349Z             op = silu_mul_quant
2025-05-07T20:33:21.1026609Z             if compiled:
2025-05-07T20:33:21.1026861Z                 op = torch.compile(op)
2025-05-07T20:33:21.1027175Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.1027470Z     
2025-05-07T20:33:21.1027670Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:21.1027842Z 
2025-05-07T20:33:21.1027948Z moe/activation_test.py:117: 
2025-05-07T20:33:21.1028264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.1028615Z moe/activation_test.py:115: in fn
2025-05-07T20:33:21.1028900Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.1029645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:21.1030658Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:21.1031325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.1032057Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.1032775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.1033426Z     kernel = self.compile(
2025-05-07T20:33:21.1033999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.1034703Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.1035120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.1035365Z 
2025-05-07T20:33:21.1035587Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fff923d0>
2025-05-07T20:33:21.1036761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.1038290Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fffea280>}
2025-05-07T20:33:21.1039769Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.1040877Z context = <triton._C.libtriton.ir.context object at 0x7f58ffa5d2f0>
2025-05-07T20:33:21.1041180Z 
2025-05-07T20:33:21.1041356Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.1041901Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.1042407Z                            module_map=module_map)
2025-05-07T20:33:21.1042785Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.1043142Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:21.1043411Z E       ^
2025-05-07T20:33:21.1043915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.1044407Z 
2025-05-07T20:33:21.1044861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.1045420Z 
2025-05-07T20:33:21.1045568Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.1045999Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.1046425Z     T=1,
2025-05-07T20:33:21.1046604Z     D=7168,
2025-05-07T20:33:21.1046804Z     scale_ub=None,
2025-05-07T20:33:21.1047020Z     contiguous=True,
2025-05-07T20:33:21.1047237Z     compiled=True,
2025-05-07T20:33:21.1047448Z )
2025-05-07T20:33:21.1047778Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:21.1048289Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:21.1048559Z 
2025-05-07T20:33:21.1048632Z     @given(
2025-05-07T20:33:21.1048863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:21.1049188Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:21.1049500Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:21.1049837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:21.1050179Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:21.1050466Z     )
2025-05-07T20:33:21.1050822Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:21.1051286Z     def test_silu_mul_quant(
2025-05-07T20:33:21.1051527Z         self,
2025-05-07T20:33:21.1051719Z         T: int,
2025-05-07T20:33:21.1051969Z         D: int,
2025-05-07T20:33:21.1052188Z         scale_ub: Optional[float],
2025-05-07T20:33:21.1052496Z         contiguous: bool,
2025-05-07T20:33:21.1052739Z         compiled: bool,
2025-05-07T20:33:21.1052960Z     ) -> None:
2025-05-07T20:33:21.1053171Z         torch.manual_seed(2025)
2025-05-07T20:33:21.1053421Z     
2025-05-07T20:33:21.1053695Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:21.1054086Z     
2025-05-07T20:33:21.1054280Z         x_sign = torch.sign(x)
2025-05-07T20:33:21.1054574Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:21.1054890Z         x = x_sign * x_clamp
2025-05-07T20:33:21.1055134Z         x0 = x[:, :D]
2025-05-07T20:33:21.1055353Z         x1 = x[:, D:]
2025-05-07T20:33:21.1055557Z     
2025-05-07T20:33:21.1055750Z         if contiguous:
2025-05-07T20:33:21.1055989Z             x0 = x0.contiguous()
2025-05-07T20:33:21.1056245Z             x1 = x1.contiguous()
2025-05-07T20:33:21.1056494Z     
2025-05-07T20:33:21.1056687Z         if scale_ub is not None:
2025-05-07T20:33:21.1056960Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:21.1057308Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:21.1057627Z             )
2025-05-07T20:33:21.1057819Z         else:
2025-05-07T20:33:21.1058022Z             scale_ub_tensor = None
2025-05-07T20:33:21.1058282Z     
2025-05-07T20:33:21.1058522Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.1058838Z             op = silu_mul_quant
2025-05-07T20:33:21.1059092Z             if compiled:
2025-05-07T20:33:21.1059342Z                 op = torch.compile(op)
2025-05-07T20:33:21.1059643Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:21.1059932Z     
2025-05-07T20:33:21.1060129Z         y_fp8, y_scale = fn()
2025-05-07T20:33:21.1060414Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:21.1060718Z     
2025-05-07T20:33:21.1060962Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:21.1061300Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:21.1061607Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:21.1061930Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:21.1062302Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:21.1062619Z     
2025-05-07T20:33:21.1062819Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:21.1063021Z 
2025-05-07T20:33:21.1063124Z moe/activation_test.py:126: 
2025-05-07T20:33:21.1063423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.1063826Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:21.1064161Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:21.1065000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:21.1065821Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:21.1066450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:21.1067188Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:21.1067920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:21.1068696Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:21.1069511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:21.1070415Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:21.1071201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:21.1071944Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:21.1072592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:21.1073194Z     fn()
2025-05-07T20:33:21.1073725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:21.1074357Z     self.fn.run(
2025-05-07T20:33:21.1074856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:21.1075463Z     kernel = self.compile(
2025-05-07T20:33:21.1076044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:21.1076747Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:21.1077163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:21.1077405Z 
2025-05-07T20:33:21.1077619Z self = <triton.compiler.compiler.ASTSource object at 0x7f590004bb20>
2025-05-07T20:33:21.1078799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:21.1080313Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fffff940>}
2025-05-07T20:33:21.1081796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:21.1083198Z context = <triton._C.libtriton.ir.context object at 0x7f58ffb4ad70>
2025-05-07T20:33:21.1083512Z 
2025-05-07T20:33:21.1083685Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:21.1084238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:21.1084735Z                            module_map=module_map)
2025-05-07T20:33:21.1085112Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:21.1085476Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:21.1085758Z E       ^
2025-05-07T20:33:21.1086244Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:21.1086741Z 
2025-05-07T20:33:21.1087267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:21.1087828Z 
2025-05-07T20:33:21.1087929Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:21.1088364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:21.1088780Z     T=4096,
2025-05-07T20:33:21.1088968Z     D=5120,
2025-05-07T20:33:21.1089160Z     scale_ub=None,
2025-05-07T20:33:21.1089367Z     contiguous=False,
2025-05-07T20:33:21.1089599Z     compiled=False,
2025-05-07T20:33:21.1089802Z )
2025-05-07T20:33:22.8425841Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.8426937Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.8427332Z 
2025-05-07T20:33:22.8427419Z     @given(
2025-05-07T20:33:22.8427686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.8428018Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.8428342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.8428699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.8429043Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.8429355Z     )
2025-05-07T20:33:22.8429732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.8430673Z     def test_silu_mul_quant(
2025-05-07T20:33:22.8430930Z         self,
2025-05-07T20:33:22.8431216Z         T: int,
2025-05-07T20:33:22.8431418Z         D: int,
2025-05-07T20:33:22.8431635Z         scale_ub: Optional[float],
2025-05-07T20:33:22.8431912Z         contiguous: bool,
2025-05-07T20:33:22.8432161Z         compiled: bool,
2025-05-07T20:33:22.8432385Z     ) -> None:
2025-05-07T20:33:22.8432600Z         torch.manual_seed(2025)
2025-05-07T20:33:22.8432933Z     
2025-05-07T20:33:22.8433204Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.8433561Z     
2025-05-07T20:33:22.8433755Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.8434046Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.8434364Z         x = x_sign * x_clamp
2025-05-07T20:33:22.8434607Z         x0 = x[:, :D]
2025-05-07T20:33:22.8434816Z         x1 = x[:, D:]
2025-05-07T20:33:22.8435025Z     
2025-05-07T20:33:22.8435211Z         if contiguous:
2025-05-07T20:33:22.8435447Z             x0 = x0.contiguous()
2025-05-07T20:33:22.8435705Z             x1 = x1.contiguous()
2025-05-07T20:33:22.8435951Z     
2025-05-07T20:33:22.8436144Z         if scale_ub is not None:
2025-05-07T20:33:22.8436415Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.8436758Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.8437078Z             )
2025-05-07T20:33:22.8437266Z         else:
2025-05-07T20:33:22.8437484Z             scale_ub_tensor = None
2025-05-07T20:33:22.8437741Z     
2025-05-07T20:33:22.8437966Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.8438292Z             op = silu_mul_quant
2025-05-07T20:33:22.8438549Z             if compiled:
2025-05-07T20:33:22.8438795Z                 op = torch.compile(op)
2025-05-07T20:33:22.8439100Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.8439387Z     
2025-05-07T20:33:22.8439570Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.8439744Z 
2025-05-07T20:33:22.8439845Z moe/activation_test.py:117: 
2025-05-07T20:33:22.8440152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.8440502Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.8440782Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.8441528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.8442286Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.8442848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.8443674Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.8444383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.8444948Z     kernel = self.compile(
2025-05-07T20:33:22.8445513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.8446215Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.8446627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.8446867Z 
2025-05-07T20:33:22.8447085Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ffbf7a00>
2025-05-07T20:33:22.8448245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.8449766Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ffeb8430>}
2025-05-07T20:33:22.8451272Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.8452410Z context = <triton._C.libtriton.ir.context object at 0x7f58ffaa8bf0>
2025-05-07T20:33:22.8452713Z 
2025-05-07T20:33:22.8452879Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.8453425Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.8453960Z                            module_map=module_map)
2025-05-07T20:33:22.8454337Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.8454694Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.8454957Z E       ^
2025-05-07T20:33:22.8461235Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.8461778Z 
2025-05-07T20:33:22.8462255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.8462835Z 
2025-05-07T20:33:22.8462952Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.8463399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.8463842Z     T=4096,
2025-05-07T20:33:22.8464041Z     D=7168,
2025-05-07T20:33:22.8464253Z     scale_ub=None,
2025-05-07T20:33:22.8464489Z     contiguous=False,
2025-05-07T20:33:22.8464735Z     compiled=False,
2025-05-07T20:33:22.8464958Z )
2025-05-07T20:33:22.8465299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.8465828Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:22.8466129Z 
2025-05-07T20:33:22.8466211Z     @given(
2025-05-07T20:33:22.8466455Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.8466781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.8467107Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.8467461Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.8467813Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.8468113Z     )
2025-05-07T20:33:22.8468487Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.8468966Z     def test_silu_mul_quant(
2025-05-07T20:33:22.8469216Z         self,
2025-05-07T20:33:22.8469423Z         T: int,
2025-05-07T20:33:22.8469638Z         D: int,
2025-05-07T20:33:22.8470033Z         scale_ub: Optional[float],
2025-05-07T20:33:22.8470322Z         contiguous: bool,
2025-05-07T20:33:22.8470579Z         compiled: bool,
2025-05-07T20:33:22.8470887Z     ) -> None:
2025-05-07T20:33:22.8471123Z         torch.manual_seed(2025)
2025-05-07T20:33:22.8471381Z     
2025-05-07T20:33:22.8471661Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.8472030Z     
2025-05-07T20:33:22.8472238Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.8472543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.8472878Z         x = x_sign * x_clamp
2025-05-07T20:33:22.8473129Z         x0 = x[:, :D]
2025-05-07T20:33:22.8473362Z         x1 = x[:, D:]
2025-05-07T20:33:22.8473570Z     
2025-05-07T20:33:22.8473766Z         if contiguous:
2025-05-07T20:33:22.8474004Z             x0 = x0.contiguous()
2025-05-07T20:33:22.8474273Z             x1 = x1.contiguous()
2025-05-07T20:33:22.8474520Z     
2025-05-07T20:33:22.8474718Z         if scale_ub is not None:
2025-05-07T20:33:22.8475001Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.8475345Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.8475671Z             )
2025-05-07T20:33:22.8475865Z         else:
2025-05-07T20:33:22.8476068Z             scale_ub_tensor = None
2025-05-07T20:33:22.8476349Z     
2025-05-07T20:33:22.8476616Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.8476938Z             op = silu_mul_quant
2025-05-07T20:33:22.8477249Z             if compiled:
2025-05-07T20:33:22.8477505Z                 op = torch.compile(op)
2025-05-07T20:33:22.8477850Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.8478141Z     
2025-05-07T20:33:22.8478336Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.8478504Z 
2025-05-07T20:33:22.8478609Z moe/activation_test.py:117: 
2025-05-07T20:33:22.8478908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.8479307Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.8479598Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.8480341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.8481090Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.8481671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.8482410Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.8483578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.8484151Z     kernel = self.compile(
2025-05-07T20:33:22.8484724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.8485420Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.8485840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.8486090Z 
2025-05-07T20:33:22.8486304Z self = <triton.compiler.compiler.ASTSource object at 0x7f59000ac340>
2025-05-07T20:33:22.8487476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.8488981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ffbdedc0>}
2025-05-07T20:33:22.8490457Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.8491570Z context = <triton._C.libtriton.ir.context object at 0x7f58ff5b5670>
2025-05-07T20:33:22.8491874Z 
2025-05-07T20:33:22.8492052Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.8492694Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.8493187Z                            module_map=module_map)
2025-05-07T20:33:22.8493568Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.8493934Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.8494201Z E       ^
2025-05-07T20:33:22.8494697Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.8495189Z 
2025-05-07T20:33:22.8495644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.8496200Z 
2025-05-07T20:33:22.8496310Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.8496737Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.8497167Z     T=128,
2025-05-07T20:33:22.8497355Z     D=7168,
2025-05-07T20:33:22.8497548Z     scale_ub=None,
2025-05-07T20:33:22.8497771Z     contiguous=False,
2025-05-07T20:33:22.8498002Z     compiled=True,
2025-05-07T20:33:22.8498199Z )
2025-05-07T20:33:22.9256776Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.9257694Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:22.9258063Z 
2025-05-07T20:33:22.9258144Z     @given(
2025-05-07T20:33:22.9258473Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.9258794Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.9259111Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.9259446Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.9259782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.9260154Z     )
2025-05-07T20:33:22.9260511Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.9260978Z     def test_silu_mul_quant(
2025-05-07T20:33:22.9261222Z         self,
2025-05-07T20:33:22.9261408Z         T: int,
2025-05-07T20:33:22.9261604Z         D: int,
2025-05-07T20:33:22.9261819Z         scale_ub: Optional[float],
2025-05-07T20:33:22.9262093Z         contiguous: bool,
2025-05-07T20:33:22.9262335Z         compiled: bool,
2025-05-07T20:33:22.9262569Z     ) -> None:
2025-05-07T20:33:22.9262780Z         torch.manual_seed(2025)
2025-05-07T20:33:22.9263026Z     
2025-05-07T20:33:22.9263301Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.9263655Z     
2025-05-07T20:33:22.9263842Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.9264136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.9264451Z         x = x_sign * x_clamp
2025-05-07T20:33:22.9264689Z         x0 = x[:, :D]
2025-05-07T20:33:22.9264906Z         x1 = x[:, D:]
2025-05-07T20:33:22.9265119Z     
2025-05-07T20:33:22.9265297Z         if contiguous:
2025-05-07T20:33:22.9265531Z             x0 = x0.contiguous()
2025-05-07T20:33:22.9265798Z             x1 = x1.contiguous()
2025-05-07T20:33:22.9266036Z     
2025-05-07T20:33:22.9266229Z         if scale_ub is not None:
2025-05-07T20:33:22.9266509Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.9266848Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.9267169Z             )
2025-05-07T20:33:22.9267364Z         else:
2025-05-07T20:33:22.9267574Z             scale_ub_tensor = None
2025-05-07T20:33:22.9267838Z     
2025-05-07T20:33:22.9268076Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.9268401Z             op = silu_mul_quant
2025-05-07T20:33:22.9268650Z             if compiled:
2025-05-07T20:33:22.9268902Z                 op = torch.compile(op)
2025-05-07T20:33:22.9269210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.9269490Z     
2025-05-07T20:33:22.9269684Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.9270235Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.9270533Z     
2025-05-07T20:33:22.9270771Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.9271122Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.9271416Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.9271739Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.9272112Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.9272432Z     
2025-05-07T20:33:22.9272628Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.9272835Z 
2025-05-07T20:33:22.9272931Z moe/activation_test.py:126: 
2025-05-07T20:33:22.9273238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.9273585Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.9273924Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.9274779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.9275588Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.9276166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.9276954Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.9277732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.9278500Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.9279316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:22.9280187Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.9280973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.9281655Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.9282302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.9283118Z     fn()
2025-05-07T20:33:22.9283656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.9284283Z     self.fn.run(
2025-05-07T20:33:22.9284771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.9285335Z     kernel = self.compile(
2025-05-07T20:33:22.9285903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.9286606Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.9287023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.9287266Z 
2025-05-07T20:33:22.9287485Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff910940>
2025-05-07T20:33:22.9288650Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.9290169Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ff752160>}
2025-05-07T20:33:22.9291638Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.9292756Z context = <triton._C.libtriton.ir.context object at 0x7f58ff4d9270>
2025-05-07T20:33:22.9293061Z 
2025-05-07T20:33:22.9293305Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.9293859Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.9294353Z                            module_map=module_map)
2025-05-07T20:33:22.9294734Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.9295099Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.9295379Z E       ^
2025-05-07T20:33:22.9295871Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.9296360Z 
2025-05-07T20:33:22.9296808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.9297374Z 
2025-05-07T20:33:22.9297475Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.9297906Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.9298337Z     T=128,
2025-05-07T20:33:22.9298523Z     D=7168,
2025-05-07T20:33:22.9298715Z     scale_ub=None,
2025-05-07T20:33:22.9298929Z     contiguous=False,
2025-05-07T20:33:22.9299162Z     compiled=False,
2025-05-07T20:33:22.9299373Z )
2025-05-07T20:33:23.3273497Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:23.3274135Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:23.3274545Z 
2025-05-07T20:33:23.3274627Z     @given(
2025-05-07T20:33:23.3274860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:23.3275187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:23.3275503Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:23.3275932Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:23.3276271Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:23.3276564Z     )
2025-05-07T20:33:23.3276934Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:23.3277398Z     def test_silu_mul_quant(
2025-05-07T20:33:23.3277638Z         self,
2025-05-07T20:33:23.3277824Z         T: int,
2025-05-07T20:33:23.3278019Z         D: int,
2025-05-07T20:33:23.3278232Z         scale_ub: Optional[float],
2025-05-07T20:33:23.3278501Z         contiguous: bool,
2025-05-07T20:33:23.3278741Z         compiled: bool,
2025-05-07T20:33:23.3278970Z     ) -> None:
2025-05-07T20:33:23.3279178Z         torch.manual_seed(2025)
2025-05-07T20:33:23.3279424Z     
2025-05-07T20:33:23.3279697Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:23.3280049Z     
2025-05-07T20:33:23.3280274Z         x_sign = torch.sign(x)
2025-05-07T20:33:23.3280561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:23.3280888Z         x = x_sign * x_clamp
2025-05-07T20:33:23.3281131Z         x0 = x[:, :D]
2025-05-07T20:33:23.3281343Z         x1 = x[:, D:]
2025-05-07T20:33:23.3281553Z     
2025-05-07T20:33:23.3281736Z         if contiguous:
2025-05-07T20:33:23.3281962Z             x0 = x0.contiguous()
2025-05-07T20:33:23.3282222Z             x1 = x1.contiguous()
2025-05-07T20:33:23.3282465Z     
2025-05-07T20:33:23.3282655Z         if scale_ub is not None:
2025-05-07T20:33:23.3283231Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:23.3283575Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:23.3283898Z             )
2025-05-07T20:33:23.3284079Z         else:
2025-05-07T20:33:23.3284288Z             scale_ub_tensor = None
2025-05-07T20:33:23.3284545Z     
2025-05-07T20:33:23.3284769Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:23.3285101Z             op = silu_mul_quant
2025-05-07T20:33:23.3285363Z             if compiled:
2025-05-07T20:33:23.3285605Z                 op = torch.compile(op)
2025-05-07T20:33:23.3285909Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:23.3286194Z     
2025-05-07T20:33:23.3286471Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:23.3286650Z 
2025-05-07T20:33:23.3286749Z moe/activation_test.py:117: 
2025-05-07T20:33:23.3287052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.3287398Z moe/activation_test.py:115: in fn
2025-05-07T20:33:23.3287683Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:23.3288421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:23.3289165Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:23.3289726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:23.3290457Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:23.3291169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:23.3291738Z     kernel = self.compile(
2025-05-07T20:33:23.3292303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:23.3293000Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:23.3293484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.3293729Z 
2025-05-07T20:33:23.3294003Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff562790>
2025-05-07T20:33:23.3295167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:23.3296820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ff6dd940>}
2025-05-07T20:33:23.3298282Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:23.3299387Z context = <triton._C.libtriton.ir.context object at 0x7f58ff4a95f0>
2025-05-07T20:33:23.3299691Z 
2025-05-07T20:33:23.3299866Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:23.3300411Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:23.3300905Z                            module_map=module_map)
2025-05-07T20:33:23.3301283Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:23.3301637Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:23.3301903Z E       ^
2025-05-07T20:33:23.3302391Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:23.3302877Z 
2025-05-07T20:33:23.3303330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:23.3303886Z 
2025-05-07T20:33:23.3303988Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:23.3304419Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:23.3304840Z     T=4096,
2025-05-07T20:33:23.3305026Z     D=5120,
2025-05-07T20:33:23.3305222Z     scale_ub=1200.0,
2025-05-07T20:33:23.3305447Z     contiguous=True,
2025-05-07T20:33:23.3305666Z     compiled=False,
2025-05-07T20:33:23.3305867Z )
2025-05-07T20:33:23.3306191Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:23.3306712Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:23.3307005Z 
2025-05-07T20:33:23.3307081Z     @given(
2025-05-07T20:33:23.3307311Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:23.3307684Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:23.3308025Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:23.3308398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:23.3308776Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:23.3309094Z     )
2025-05-07T20:33:23.3309498Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:23.3310131Z     def test_silu_mul_quant(
2025-05-07T20:33:23.3310378Z         self,
2025-05-07T20:33:23.3310568Z         T: int,
2025-05-07T20:33:23.3310762Z         D: int,
2025-05-07T20:33:23.3310984Z         scale_ub: Optional[float],
2025-05-07T20:33:23.3311252Z         contiguous: bool,
2025-05-07T20:33:23.3311491Z         compiled: bool,
2025-05-07T20:33:23.3311720Z     ) -> None:
2025-05-07T20:33:23.3311932Z         torch.manual_seed(2025)
2025-05-07T20:33:23.3312185Z     
2025-05-07T20:33:23.3312457Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:23.3312810Z     
2025-05-07T20:33:23.3313008Z         x_sign = torch.sign(x)
2025-05-07T20:33:23.3313304Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:23.3313619Z         x = x_sign * x_clamp
2025-05-07T20:33:23.3313861Z         x0 = x[:, :D]
2025-05-07T20:33:23.3314077Z         x1 = x[:, D:]
2025-05-07T20:33:23.3314327Z     
2025-05-07T20:33:23.3314510Z         if contiguous:
2025-05-07T20:33:23.3314779Z             x0 = x0.contiguous()
2025-05-07T20:33:23.3315041Z             x1 = x1.contiguous()
2025-05-07T20:33:23.3315282Z     
2025-05-07T20:33:23.3315471Z         if scale_ub is not None:
2025-05-07T20:33:23.3315747Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:23.3316081Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:23.3316437Z             )
2025-05-07T20:33:23.3316630Z         else:
2025-05-07T20:33:23.3316831Z             scale_ub_tensor = None
2025-05-07T20:33:23.3317087Z     
2025-05-07T20:33:23.3317317Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:23.3317633Z             op = silu_mul_quant
2025-05-07T20:33:23.3317889Z             if compiled:
2025-05-07T20:33:23.3318138Z                 op = torch.compile(op)
2025-05-07T20:33:23.3318431Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:23.3318713Z     
2025-05-07T20:33:23.3318908Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:23.3319075Z 
2025-05-07T20:33:23.3319180Z moe/activation_test.py:117: 
2025-05-07T20:33:23.3319474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.3319820Z moe/activation_test.py:115: in fn
2025-05-07T20:33:23.3320106Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:23.3320831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:23.3321579Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:23.3322150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:23.3322881Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:23.3323590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:23.3324162Z     kernel = self.compile(
2025-05-07T20:33:23.3324738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:23.3325436Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:23.3325850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.3326097Z 
2025-05-07T20:33:23.3326308Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff6f2ac0>
2025-05-07T20:33:23.3327527Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:23.3329025Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ff73a8b0>}
2025-05-07T20:33:23.3330492Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:23.3331601Z context = <triton._C.libtriton.ir.context object at 0x7f58ff03a1f0>
2025-05-07T20:33:23.3331905Z 
2025-05-07T20:33:23.3332081Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:23.3332633Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:23.3333120Z                            module_map=module_map)
2025-05-07T20:33:23.3333503Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:23.3333868Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:23.3334128Z E       ^
2025-05-07T20:33:23.3334617Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:23.3335102Z 
2025-05-07T20:33:23.3335599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:23.3336189Z 
2025-05-07T20:33:23.3336294Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:23.3336718Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:23.3337144Z     T=1,
2025-05-07T20:33:23.3337322Z     D=5120,
2025-05-07T20:33:23.3337581Z     scale_ub=None,
2025-05-07T20:33:23.3337793Z     contiguous=True,
2025-05-07T20:33:23.3338013Z     compiled=True,
2025-05-07T20:33:23.3338208Z )
2025-05-07T20:33:23.9874448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:23.9875167Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:23.9875445Z 
2025-05-07T20:33:23.9875538Z     @given(
2025-05-07T20:33:23.9875769Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:23.9876101Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:23.9876428Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:23.9876774Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:23.9877117Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:23.9877416Z     )
2025-05-07T20:33:23.9877782Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:23.9878247Z     def test_silu_mul_quant(
2025-05-07T20:33:23.9878499Z         self,
2025-05-07T20:33:23.9878693Z         T: int,
2025-05-07T20:33:23.9878884Z         D: int,
2025-05-07T20:33:23.9885519Z         scale_ub: Optional[float],
2025-05-07T20:33:23.9885863Z         contiguous: bool,
2025-05-07T20:33:23.9886127Z         compiled: bool,
2025-05-07T20:33:23.9886367Z     ) -> None:
2025-05-07T20:33:23.9886600Z         torch.manual_seed(2025)
2025-05-07T20:33:23.9886898Z     
2025-05-07T20:33:23.9887186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:23.9887554Z     
2025-05-07T20:33:23.9887762Z         x_sign = torch.sign(x)
2025-05-07T20:33:23.9888059Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:23.9888387Z         x = x_sign * x_clamp
2025-05-07T20:33:23.9888640Z         x0 = x[:, :D]
2025-05-07T20:33:23.9888856Z         x1 = x[:, D:]
2025-05-07T20:33:23.9889077Z     
2025-05-07T20:33:23.9889276Z         if contiguous:
2025-05-07T20:33:23.9889508Z             x0 = x0.contiguous()
2025-05-07T20:33:23.9889785Z             x1 = x1.contiguous()
2025-05-07T20:33:23.9890040Z     
2025-05-07T20:33:23.9890232Z         if scale_ub is not None:
2025-05-07T20:33:23.9890517Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:23.9891175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:23.9891515Z             )
2025-05-07T20:33:23.9891714Z         else:
2025-05-07T20:33:23.9891934Z             scale_ub_tensor = None
2025-05-07T20:33:23.9892200Z     
2025-05-07T20:33:23.9892437Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:23.9892778Z             op = silu_mul_quant
2025-05-07T20:33:23.9893051Z             if compiled:
2025-05-07T20:33:23.9893299Z                 op = torch.compile(op)
2025-05-07T20:33:23.9893611Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:23.9893901Z     
2025-05-07T20:33:23.9894095Z         y_fp8, y_scale = fn()
2025-05-07T20:33:23.9894392Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:23.9894703Z     
2025-05-07T20:33:23.9894947Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:23.9895296Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:23.9895607Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:23.9895930Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:23.9896308Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:23.9896635Z     
2025-05-07T20:33:23.9896834Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:23.9897130Z 
2025-05-07T20:33:23.9897239Z moe/activation_test.py:126: 
2025-05-07T20:33:23.9897619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.9897974Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:23.9898309Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:23.9899153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:23.9900048Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:23.9900629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:23.9901361Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:23.9902090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:23.9902867Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:23.9903675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:23.9904477Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:23.9905253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:23.9905936Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:23.9906577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:23.9907124Z     fn()
2025-05-07T20:33:23.9907658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:23.9908279Z     self.fn.run(
2025-05-07T20:33:23.9908771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:23.9909334Z     kernel = self.compile(
2025-05-07T20:33:23.9910070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:23.9910767Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:23.9911175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.9911426Z 
2025-05-07T20:33:23.9911638Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff086520>
2025-05-07T20:33:23.9912860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:23.9914381Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ff44e550>}
2025-05-07T20:33:23.9915850Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:23.9916990Z context = <triton._C.libtriton.ir.context object at 0x7f58ff4526f0>
2025-05-07T20:33:23.9917318Z 
2025-05-07T20:33:23.9917487Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:23.9918034Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:23.9918528Z                            module_map=module_map)
2025-05-07T20:33:23.9918897Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:23.9919261Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:23.9919535Z E       ^
2025-05-07T20:33:23.9920065Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:23.9920603Z 
2025-05-07T20:33:23.9921051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:23.9921614Z 
2025-05-07T20:33:23.9921716Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:23.9922146Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:23.9922601Z     T=2048,
2025-05-07T20:33:23.9922785Z     D=5120,
2025-05-07T20:33:23.9922976Z     scale_ub=None,
2025-05-07T20:33:23.9923186Z     contiguous=True,
2025-05-07T20:33:23.9923412Z     compiled=True,
2025-05-07T20:33:23.9923624Z )
2025-05-07T20:33:24.6038460Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.6039286Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:24.6039692Z 
2025-05-07T20:33:24.6039807Z     @given(
2025-05-07T20:33:24.6040155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.6040574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.6040992Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.6041430Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.6041799Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.6042106Z     )
2025-05-07T20:33:24.6042479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.6042955Z     def test_silu_mul_quant(
2025-05-07T20:33:24.6043204Z         self,
2025-05-07T20:33:24.6043410Z         T: int,
2025-05-07T20:33:24.6043608Z         D: int,
2025-05-07T20:33:24.6043838Z         scale_ub: Optional[float],
2025-05-07T20:33:24.6044118Z         contiguous: bool,
2025-05-07T20:33:24.6044361Z         compiled: bool,
2025-05-07T20:33:24.6044602Z     ) -> None:
2025-05-07T20:33:24.6044830Z         torch.manual_seed(2025)
2025-05-07T20:33:24.6045084Z     
2025-05-07T20:33:24.6045366Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.6045741Z     
2025-05-07T20:33:24.6045946Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.6046244Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.6046578Z         x = x_sign * x_clamp
2025-05-07T20:33:24.6046833Z         x0 = x[:, :D]
2025-05-07T20:33:24.6047053Z         x1 = x[:, D:]
2025-05-07T20:33:24.6047284Z     
2025-05-07T20:33:24.6047481Z         if contiguous:
2025-05-07T20:33:24.6047713Z             x0 = x0.contiguous()
2025-05-07T20:33:24.6047987Z             x1 = x1.contiguous()
2025-05-07T20:33:24.6048240Z     
2025-05-07T20:33:24.6048721Z         if scale_ub is not None:
2025-05-07T20:33:24.6049013Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.6049364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.6049688Z             )
2025-05-07T20:33:24.6049890Z         else:
2025-05-07T20:33:24.6050107Z             scale_ub_tensor = None
2025-05-07T20:33:24.6050380Z     
2025-05-07T20:33:24.6050615Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.6050950Z             op = silu_mul_quant
2025-05-07T20:33:24.6051212Z             if compiled:
2025-05-07T20:33:24.6051464Z                 op = torch.compile(op)
2025-05-07T20:33:24.6051776Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.6052067Z     
2025-05-07T20:33:24.6052266Z         y_fp8, y_scale = fn()
2025-05-07T20:33:24.6052560Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:24.6052869Z     
2025-05-07T20:33:24.6053107Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.6053465Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:24.6053779Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:24.6054104Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:24.6054485Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:24.6054900Z     
2025-05-07T20:33:24.6055119Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:24.6055430Z 
2025-05-07T20:33:24.6055531Z moe/activation_test.py:126: 
2025-05-07T20:33:24.6055837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.6056187Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:24.6056520Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:24.6057465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:24.6058297Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:24.6058883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.6059620Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.6060370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:24.6061167Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:24.6061985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:24.6062795Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:24.6063591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:24.6064285Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:24.6064925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:24.6065488Z     fn()
2025-05-07T20:33:24.6066031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:24.6066665Z     self.fn.run(
2025-05-07T20:33:24.6067207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.6067786Z     kernel = self.compile(
2025-05-07T20:33:24.6068365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.6069071Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.6069495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.6069927Z 
2025-05-07T20:33:24.6070198Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff3f8490>
2025-05-07T20:33:24.6071397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.6072958Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fefa7f70>}
2025-05-07T20:33:24.6074450Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.6075573Z context = <triton._C.libtriton.ir.context object at 0x7f58ff18f1f0>
2025-05-07T20:33:24.6075888Z 
2025-05-07T20:33:24.6076058Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.6076620Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.6077110Z                            module_map=module_map)
2025-05-07T20:33:24.6077495Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.6077868Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:24.6078226Z E       ^
2025-05-07T20:33:24.6078724Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.6079269Z 
2025-05-07T20:33:24.6079725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.6080291Z 
2025-05-07T20:33:24.6080401Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.6080869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.6081296Z     T=128,
2025-05-07T20:33:24.6081487Z     D=5120,
2025-05-07T20:33:24.6081676Z     scale_ub=None,
2025-05-07T20:33:24.6081893Z     contiguous=True,
2025-05-07T20:33:24.6082120Z     compiled=True,
2025-05-07T20:33:24.6082320Z )
2025-05-07T20:33:25.5951103Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.5951880Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:25.5952298Z 
2025-05-07T20:33:25.5952421Z     @given(
2025-05-07T20:33:25.5952741Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.5953082Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.5953413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.5953762Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.5954118Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.5954436Z     )
2025-05-07T20:33:25.5954820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.5955307Z     def test_silu_mul_quant(
2025-05-07T20:33:25.5955577Z         self,
2025-05-07T20:33:25.5955784Z         T: int,
2025-05-07T20:33:25.5955988Z         D: int,
2025-05-07T20:33:25.5956222Z         scale_ub: Optional[float],
2025-05-07T20:33:25.5956514Z         contiguous: bool,
2025-05-07T20:33:25.5956764Z         compiled: bool,
2025-05-07T20:33:25.5957009Z     ) -> None:
2025-05-07T20:33:25.5957263Z         torch.manual_seed(2025)
2025-05-07T20:33:25.5957538Z     
2025-05-07T20:33:25.5957825Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.5958197Z     
2025-05-07T20:33:25.5958423Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.5958724Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.5959056Z         x = x_sign * x_clamp
2025-05-07T20:33:25.5959311Z         x0 = x[:, :D]
2025-05-07T20:33:25.5959531Z         x1 = x[:, D:]
2025-05-07T20:33:25.5959759Z     
2025-05-07T20:33:25.5959955Z         if contiguous:
2025-05-07T20:33:25.5960190Z             x0 = x0.contiguous()
2025-05-07T20:33:25.5960794Z             x1 = x1.contiguous()
2025-05-07T20:33:25.5961051Z     
2025-05-07T20:33:25.5961255Z         if scale_ub is not None:
2025-05-07T20:33:25.5961537Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.5961889Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.5962209Z             )
2025-05-07T20:33:25.5962397Z         else:
2025-05-07T20:33:25.5962606Z             scale_ub_tensor = None
2025-05-07T20:33:25.5962872Z     
2025-05-07T20:33:25.5963096Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.5963424Z             op = silu_mul_quant
2025-05-07T20:33:25.5963676Z             if compiled:
2025-05-07T20:33:25.5963920Z                 op = torch.compile(op)
2025-05-07T20:33:25.5964221Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.5964511Z     
2025-05-07T20:33:25.5964696Z         y_fp8, y_scale = fn()
2025-05-07T20:33:25.5964983Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:25.5965287Z     
2025-05-07T20:33:25.5965520Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.5965868Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:25.5966180Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:25.5966507Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:25.5966971Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:25.5967399Z     
2025-05-07T20:33:25.5967609Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:25.5967812Z 
2025-05-07T20:33:25.5967911Z moe/activation_test.py:126: 
2025-05-07T20:33:25.5968214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.5968562Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:25.5968983Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:25.5969838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:25.5970659Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:25.5971240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.5971971Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.5972711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:25.5973493Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:25.5974303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:25.5975108Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:25.5975895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:25.5976581Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:25.5977225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:25.5977776Z     fn()
2025-05-07T20:33:25.5978317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:25.5978942Z     self.fn.run(
2025-05-07T20:33:25.5979428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.5979992Z     kernel = self.compile(
2025-05-07T20:33:25.5980570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.5981274Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.5981687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.5981991Z 
2025-05-07T20:33:25.5982205Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff2b9bb0>
2025-05-07T20:33:25.5983716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.5985238Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ff2d2b80>}
2025-05-07T20:33:25.5986704Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.5987816Z context = <triton._C.libtriton.ir.context object at 0x7f58fed4cdb0>
2025-05-07T20:33:25.5988126Z 
2025-05-07T20:33:25.5988297Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.5988843Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.5989326Z                            module_map=module_map)
2025-05-07T20:33:25.5989706Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.5990245Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:25.5990597Z E       ^
2025-05-07T20:33:25.5991146Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.5991712Z 
2025-05-07T20:33:25.5992222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.5992919Z 
2025-05-07T20:33:25.5993033Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.5993502Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.5993973Z     T=4096,
2025-05-07T20:33:25.5994170Z     D=5120,
2025-05-07T20:33:25.5994366Z     scale_ub=None,
2025-05-07T20:33:25.5994594Z     contiguous=True,
2025-05-07T20:33:25.5994832Z     compiled=True,
2025-05-07T20:33:25.5995049Z )
2025-05-07T20:33:26.4281572Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.4282357Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:26.4282700Z 
2025-05-07T20:33:26.4283021Z     @given(
2025-05-07T20:33:26.4283267Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.4283586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.4283902Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.4284245Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.4284583Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.4284881Z     )
2025-05-07T20:33:26.4285253Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.4285711Z     def test_silu_mul_quant(
2025-05-07T20:33:26.4285955Z         self,
2025-05-07T20:33:26.4286152Z         T: int,
2025-05-07T20:33:26.4286347Z         D: int,
2025-05-07T20:33:26.4286573Z         scale_ub: Optional[float],
2025-05-07T20:33:26.4286847Z         contiguous: bool,
2025-05-07T20:33:26.4287102Z         compiled: bool,
2025-05-07T20:33:26.4287326Z     ) -> None:
2025-05-07T20:33:26.4287547Z         torch.manual_seed(2025)
2025-05-07T20:33:26.4287796Z     
2025-05-07T20:33:26.4288066Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.4288428Z     
2025-05-07T20:33:26.4288625Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.4288914Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.4289238Z         x = x_sign * x_clamp
2025-05-07T20:33:26.4289482Z         x0 = x[:, :D]
2025-05-07T20:33:26.4289695Z         x1 = x[:, D:]
2025-05-07T20:33:26.4289903Z     
2025-05-07T20:33:26.4290416Z         if contiguous:
2025-05-07T20:33:26.4290646Z             x0 = x0.contiguous()
2025-05-07T20:33:26.4290905Z             x1 = x1.contiguous()
2025-05-07T20:33:26.4291150Z     
2025-05-07T20:33:26.4291335Z         if scale_ub is not None:
2025-05-07T20:33:26.4291611Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.4291952Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.4292268Z             )
2025-05-07T20:33:26.4292453Z         else:
2025-05-07T20:33:26.4292660Z             scale_ub_tensor = None
2025-05-07T20:33:26.4292917Z     
2025-05-07T20:33:26.4293144Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.4293468Z             op = silu_mul_quant
2025-05-07T20:33:26.4293721Z             if compiled:
2025-05-07T20:33:26.4293962Z                 op = torch.compile(op)
2025-05-07T20:33:26.4294264Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.4294548Z     
2025-05-07T20:33:26.4294733Z         y_fp8, y_scale = fn()
2025-05-07T20:33:26.4295026Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:26.4295325Z     
2025-05-07T20:33:26.4295556Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.4295899Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:26.4296295Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:26.4296614Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:26.4297076Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:26.4297446Z     
2025-05-07T20:33:26.4297651Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:26.4297850Z 
2025-05-07T20:33:26.4297946Z moe/activation_test.py:126: 
2025-05-07T20:33:26.4298245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.4298668Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:26.4298997Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:26.4299849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:26.4300665Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:26.4301251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.4301976Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.4302718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:26.4303494Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:26.4304302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:26.4305101Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:26.4305885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:26.4306568Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:26.4307201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:26.4307757Z     fn()
2025-05-07T20:33:26.4308294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:26.4308918Z     self.fn.run(
2025-05-07T20:33:26.4309401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.4310121Z     kernel = self.compile(
2025-05-07T20:33:26.4310697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.4311387Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.4312376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.4312631Z 
2025-05-07T20:33:26.4312847Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff11c970>
2025-05-07T20:33:26.4314016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.4315549Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58feddf5e0>}
2025-05-07T20:33:26.4317010Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.4318139Z context = <triton._C.libtriton.ir.context object at 0x7f58fe9787f0>
2025-05-07T20:33:26.4324448Z 
2025-05-07T20:33:26.4324641Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.4325214Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.4325803Z                            module_map=module_map)
2025-05-07T20:33:26.4326185Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.4326609Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:26.4326895Z E       ^
2025-05-07T20:33:26.4327386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.4327890Z 
2025-05-07T20:33:26.4328343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.4328958Z 
2025-05-07T20:33:26.4329061Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.4329493Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.4329910Z     T=16384,
2025-05-07T20:33:26.4330109Z     D=5120,
2025-05-07T20:33:26.4330311Z     scale_ub=None,
2025-05-07T20:33:26.4330521Z     contiguous=True,
2025-05-07T20:33:26.4330752Z     compiled=True,
2025-05-07T20:33:26.4330970Z )
2025-05-07T20:33:26.4752350Z W0507 20:33:26.473680 88490 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:26.4753753Z W0507 20:33:26.473680 88490 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:26.4755245Z W0507 20:33:26.473680 88490 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:26.4756341Z W0507 20:33:26.473680 88490 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:26.4757557Z W0507 20:33:26.473680 88490 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:26.5957710Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.5958520Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:26.5958931Z 
2025-05-07T20:33:26.5959045Z     @given(
2025-05-07T20:33:26.5959305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.5959637Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.5959959Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.5960309Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.5960655Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.5961242Z     )
2025-05-07T20:33:26.5961613Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.5962091Z     def test_silu_mul_quant(
2025-05-07T20:33:26.5962339Z         self,
2025-05-07T20:33:26.5962543Z         T: int,
2025-05-07T20:33:26.5962745Z         D: int,
2025-05-07T20:33:26.5962970Z         scale_ub: Optional[float],
2025-05-07T20:33:26.5963261Z         contiguous: bool,
2025-05-07T20:33:26.5963513Z         compiled: bool,
2025-05-07T20:33:26.5963750Z     ) -> None:
2025-05-07T20:33:26.5963965Z         torch.manual_seed(2025)
2025-05-07T20:33:26.5964223Z     
2025-05-07T20:33:26.5964507Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.5964863Z     
2025-05-07T20:33:26.5965060Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.5965358Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.5965680Z         x = x_sign * x_clamp
2025-05-07T20:33:26.5965923Z         x0 = x[:, :D]
2025-05-07T20:33:26.5966149Z         x1 = x[:, D:]
2025-05-07T20:33:26.5966363Z     
2025-05-07T20:33:26.5966547Z         if contiguous:
2025-05-07T20:33:26.5966778Z             x0 = x0.contiguous()
2025-05-07T20:33:26.5967041Z             x1 = x1.contiguous()
2025-05-07T20:33:26.5967285Z     
2025-05-07T20:33:26.5967477Z         if scale_ub is not None:
2025-05-07T20:33:26.5967847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.5968262Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.5968583Z             )
2025-05-07T20:33:26.5968775Z         else:
2025-05-07T20:33:26.5968980Z             scale_ub_tensor = None
2025-05-07T20:33:26.5969237Z     
2025-05-07T20:33:26.5969474Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.5969884Z             op = silu_mul_quant
2025-05-07T20:33:26.5970133Z             if compiled:
2025-05-07T20:33:26.5970384Z                 op = torch.compile(op)
2025-05-07T20:33:26.5970692Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.5970976Z     
2025-05-07T20:33:26.5971168Z         y_fp8, y_scale = fn()
2025-05-07T20:33:26.5971456Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:26.5971752Z     
2025-05-07T20:33:26.5971994Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.5972343Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:26.5972640Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:26.5972966Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:26.5973339Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:26.5973660Z     
2025-05-07T20:33:26.5973866Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:26.5974077Z 
2025-05-07T20:33:26.5974182Z moe/activation_test.py:126: 
2025-05-07T20:33:26.5974490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.5974830Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:26.5975170Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:26.5976025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:26.5976840Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:26.5977424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.5978164Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.5978908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:26.5979674Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:26.5980486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:26.5981341Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:26.5982128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:26.5983097Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:26.5983741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:26.5984294Z     fn()
2025-05-07T20:33:26.5984823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:26.5985445Z     self.fn.run(
2025-05-07T20:33:26.5985933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.5986499Z     kernel = self.compile(
2025-05-07T20:33:26.5987064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.5987814Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.5988229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.5988470Z 
2025-05-07T20:33:26.5988694Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fec72940>
2025-05-07T20:33:26.5990112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.5991698Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ff41bee0>}
2025-05-07T20:33:26.5993227Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.5994334Z context = <triton._C.libtriton.ir.context object at 0x7f58fe50f0f0>
2025-05-07T20:33:26.5994639Z 
2025-05-07T20:33:26.5994809Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.5995362Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.5995860Z                            module_map=module_map)
2025-05-07T20:33:26.5996245Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.5996608Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:26.5996883Z E       ^
2025-05-07T20:33:26.5997377Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.5997870Z 
2025-05-07T20:33:26.5998319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.5998882Z 
2025-05-07T20:33:26.5998989Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.5999419Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.5999844Z     T=1,
2025-05-07T20:33:26.6000025Z     D=5120,
2025-05-07T20:33:26.6000230Z     scale_ub=1200.0,
2025-05-07T20:33:26.6000457Z     contiguous=True,
2025-05-07T20:33:26.6000676Z     compiled=True,
2025-05-07T20:33:26.6000890Z )
2025-05-07T20:33:26.7700997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.7701766Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:26.7702139Z 
2025-05-07T20:33:26.7702244Z     @given(
2025-05-07T20:33:26.7702535Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.7702878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.7703193Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.7703524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.7704044Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.7704341Z     )
2025-05-07T20:33:26.7704696Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.7705163Z     def test_silu_mul_quant(
2025-05-07T20:33:26.7705412Z         self,
2025-05-07T20:33:26.7705607Z         T: int,
2025-05-07T20:33:26.7705808Z         D: int,
2025-05-07T20:33:26.7706036Z         scale_ub: Optional[float],
2025-05-07T20:33:26.7706305Z         contiguous: bool,
2025-05-07T20:33:26.7706548Z         compiled: bool,
2025-05-07T20:33:26.7706773Z     ) -> None:
2025-05-07T20:33:26.7706980Z         torch.manual_seed(2025)
2025-05-07T20:33:26.7707230Z     
2025-05-07T20:33:26.7707501Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.7707857Z     
2025-05-07T20:33:26.7708041Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.7708332Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.7708649Z         x = x_sign * x_clamp
2025-05-07T20:33:26.7708893Z         x0 = x[:, :D]
2025-05-07T20:33:26.7709110Z         x1 = x[:, D:]
2025-05-07T20:33:26.7709320Z     
2025-05-07T20:33:26.7709498Z         if contiguous:
2025-05-07T20:33:26.7709732Z             x0 = x0.contiguous()
2025-05-07T20:33:26.7710310Z             x1 = x1.contiguous()
2025-05-07T20:33:26.7710550Z     
2025-05-07T20:33:26.7710737Z         if scale_ub is not None:
2025-05-07T20:33:26.7711092Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.7711431Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.7711749Z             )
2025-05-07T20:33:26.7711936Z         else:
2025-05-07T20:33:26.7712142Z             scale_ub_tensor = None
2025-05-07T20:33:26.7712508Z     
2025-05-07T20:33:26.7712738Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.7713062Z             op = silu_mul_quant
2025-05-07T20:33:26.7713309Z             if compiled:
2025-05-07T20:33:26.7713560Z                 op = torch.compile(op)
2025-05-07T20:33:26.7713862Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.7714141Z     
2025-05-07T20:33:26.7714333Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.7714499Z 
2025-05-07T20:33:26.7714604Z moe/activation_test.py:117: 
2025-05-07T20:33:26.7714906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.7715254Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.7715538Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.7716119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:26.7716717Z     return fn(*args, **kwargs)
2025-05-07T20:33:26.7717423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.7718171Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.7718733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.7719461Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.7720167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.7720738Z     kernel = self.compile(
2025-05-07T20:33:26.7721308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.7722009Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.7722422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.7722665Z 
2025-05-07T20:33:26.7722880Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff3e82b0>
2025-05-07T20:33:26.7724104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.7725619Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fedaf700>}
2025-05-07T20:33:26.7727093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.7728215Z context = <triton._C.libtriton.ir.context object at 0x7f58fdf0ddb0>
2025-05-07T20:33:26.7728523Z 
2025-05-07T20:33:26.7728693Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.7729244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.7729738Z                            module_map=module_map)
2025-05-07T20:33:26.7730109Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.7730473Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.7730747Z E       ^
2025-05-07T20:33:26.7731230Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.7731768Z 
2025-05-07T20:33:26.7732219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.7732820Z 
2025-05-07T20:33:26.7732921Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.7733350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.7733774Z     T=1,
2025-05-07T20:33:26.7733957Z     D=5120,
2025-05-07T20:33:26.7734193Z     scale_ub=None,
2025-05-07T20:33:26.7734403Z     contiguous=False,
2025-05-07T20:33:26.7734630Z     compiled=True,
2025-05-07T20:33:26.7734837Z )
2025-05-07T20:33:26.8542347Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.8543062Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:26.8543450Z 
2025-05-07T20:33:26.8543558Z     @given(
2025-05-07T20:33:26.8543870Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.8544294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.8544699Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.8545144Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.8545575Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.8545870Z     )
2025-05-07T20:33:26.8546238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.8546699Z     def test_silu_mul_quant(
2025-05-07T20:33:26.8546941Z         self,
2025-05-07T20:33:26.8547132Z         T: int,
2025-05-07T20:33:26.8547324Z         D: int,
2025-05-07T20:33:26.8547532Z         scale_ub: Optional[float],
2025-05-07T20:33:26.8547808Z         contiguous: bool,
2025-05-07T20:33:26.8548048Z         compiled: bool,
2025-05-07T20:33:26.8548268Z     ) -> None:
2025-05-07T20:33:26.8548479Z         torch.manual_seed(2025)
2025-05-07T20:33:26.8548720Z     
2025-05-07T20:33:26.8548984Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.8549343Z     
2025-05-07T20:33:26.8549536Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.8549970Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.8550290Z         x = x_sign * x_clamp
2025-05-07T20:33:26.8550528Z         x0 = x[:, :D]
2025-05-07T20:33:26.8550740Z         x1 = x[:, D:]
2025-05-07T20:33:26.8550946Z     
2025-05-07T20:33:26.8551126Z         if contiguous:
2025-05-07T20:33:26.8551361Z             x0 = x0.contiguous()
2025-05-07T20:33:26.8551616Z             x1 = x1.contiguous()
2025-05-07T20:33:26.8551869Z     
2025-05-07T20:33:26.8552058Z         if scale_ub is not None:
2025-05-07T20:33:26.8552525Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.8552869Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.8553211Z             )
2025-05-07T20:33:26.8553402Z         else:
2025-05-07T20:33:26.8553603Z             scale_ub_tensor = None
2025-05-07T20:33:26.8553861Z     
2025-05-07T20:33:26.8554100Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.8554415Z             op = silu_mul_quant
2025-05-07T20:33:26.8554669Z             if compiled:
2025-05-07T20:33:26.8554917Z                 op = torch.compile(op)
2025-05-07T20:33:26.8555214Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.8555499Z     
2025-05-07T20:33:26.8555695Z         y_fp8, y_scale = fn()
2025-05-07T20:33:26.8555980Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:26.8556281Z     
2025-05-07T20:33:26.8556517Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.8556861Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:26.8557157Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:26.8557478Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:26.8557848Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:26.8558163Z     
2025-05-07T20:33:26.8558439Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:26.8558643Z 
2025-05-07T20:33:26.8558745Z moe/activation_test.py:126: 
2025-05-07T20:33:26.8559118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.8559464Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:26.8559801Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:26.8560649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:26.8561533Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:26.8562113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.8562842Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.8563572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:26.8564348Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:26.8565155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:26.8565960Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:26.8566736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:26.8567426Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:26.8568064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:26.8568613Z     fn()
2025-05-07T20:33:26.8569137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:26.8569765Z     self.fn.run(
2025-05-07T20:33:26.8570265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.8570827Z     kernel = self.compile(
2025-05-07T20:33:26.8571403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.8572101Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.8572523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.8572772Z 
2025-05-07T20:33:26.8572987Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe0092e0>
2025-05-07T20:33:26.8574212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.8575728Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe6803a0>}
2025-05-07T20:33:26.8577194Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.8578349Z context = <triton._C.libtriton.ir.context object at 0x7f58fe483eb0>
2025-05-07T20:33:26.8578653Z 
2025-05-07T20:33:26.8578823Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.8579368Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.8579865Z                            module_map=module_map)
2025-05-07T20:33:26.8580236Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.8580601Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:26.8580878Z E       ^
2025-05-07T20:33:26.8581417Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.8581940Z 
2025-05-07T20:33:26.8582387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.8583399Z 
2025-05-07T20:33:26.8583502Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.8583936Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.8584434Z     T=1,
2025-05-07T20:33:26.8584617Z     D=5120,
2025-05-07T20:33:26.8584810Z     scale_ub=None,
2025-05-07T20:33:26.8585021Z     contiguous=True,
2025-05-07T20:33:26.8585238Z     compiled=False,
2025-05-07T20:33:26.8585445Z )
2025-05-07T20:33:27.2160964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.2161725Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:27.2162101Z 
2025-05-07T20:33:27.2162203Z     @given(
2025-05-07T20:33:27.2162565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.2162976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.2163368Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.2163722Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.2164055Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.2164351Z     )
2025-05-07T20:33:27.2164710Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.2165179Z     def test_silu_mul_quant(
2025-05-07T20:33:27.2165424Z         self,
2025-05-07T20:33:27.2165616Z         T: int,
2025-05-07T20:33:27.2165814Z         D: int,
2025-05-07T20:33:27.2166036Z         scale_ub: Optional[float],
2025-05-07T20:33:27.2166311Z         contiguous: bool,
2025-05-07T20:33:27.2166551Z         compiled: bool,
2025-05-07T20:33:27.2166774Z     ) -> None:
2025-05-07T20:33:27.2166995Z         torch.manual_seed(2025)
2025-05-07T20:33:27.2167237Z     
2025-05-07T20:33:27.2167506Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.2167866Z     
2025-05-07T20:33:27.2168058Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.2168344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.2168666Z         x = x_sign * x_clamp
2025-05-07T20:33:27.2168910Z         x0 = x[:, :D]
2025-05-07T20:33:27.2169122Z         x1 = x[:, D:]
2025-05-07T20:33:27.2169335Z     
2025-05-07T20:33:27.2169522Z         if contiguous:
2025-05-07T20:33:27.2169750Z             x0 = x0.contiguous()
2025-05-07T20:33:27.2170013Z             x1 = x1.contiguous()
2025-05-07T20:33:27.2170253Z     
2025-05-07T20:33:27.2170736Z         if scale_ub is not None:
2025-05-07T20:33:27.2171022Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.2171365Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.2171683Z             )
2025-05-07T20:33:27.2171865Z         else:
2025-05-07T20:33:27.2172081Z             scale_ub_tensor = None
2025-05-07T20:33:27.2172339Z     
2025-05-07T20:33:27.2172566Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.2172889Z             op = silu_mul_quant
2025-05-07T20:33:27.2173145Z             if compiled:
2025-05-07T20:33:27.2173388Z                 op = torch.compile(op)
2025-05-07T20:33:27.2173690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.2173976Z     
2025-05-07T20:33:27.2174161Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.2174332Z 
2025-05-07T20:33:27.2174429Z moe/activation_test.py:117: 
2025-05-07T20:33:27.2174731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.2175070Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.2175353Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.2176091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.2176928Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.2177492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.2178301Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.2179009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.2179657Z     kernel = self.compile(
2025-05-07T20:33:27.2180223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.2180922Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.2181334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.2181575Z 
2025-05-07T20:33:27.2181786Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdff83d0>
2025-05-07T20:33:27.2183257Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.2191350Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe643940>}
2025-05-07T20:33:27.2192863Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.2193988Z context = <triton._C.libtriton.ir.context object at 0x7f58fe3cd4b0>
2025-05-07T20:33:27.2194310Z 
2025-05-07T20:33:27.2194486Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.2195058Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.2195571Z                            module_map=module_map)
2025-05-07T20:33:27.2195957Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.2196340Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.2196613Z E       ^
2025-05-07T20:33:27.2197109Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.2197648Z 
2025-05-07T20:33:27.2198119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.2198688Z 
2025-05-07T20:33:27.2198900Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.2199339Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.2199757Z     T=128,
2025-05-07T20:33:27.2199951Z     D=5120,
2025-05-07T20:33:27.2200152Z     scale_ub=None,
2025-05-07T20:33:27.2200365Z     contiguous=False,
2025-05-07T20:33:27.2200591Z     compiled=True,
2025-05-07T20:33:27.2200804Z )
2025-05-07T20:33:27.2201127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.2201650Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:27.2201938Z 
2025-05-07T20:33:27.2202019Z     @given(
2025-05-07T20:33:27.2202249Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.2202559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.2202878Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.2203224Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.2203564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.2203864Z     )
2025-05-07T20:33:27.2204236Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.2204701Z     def test_silu_mul_quant(
2025-05-07T20:33:27.2204950Z         self,
2025-05-07T20:33:27.2205150Z         T: int,
2025-05-07T20:33:27.2205416Z         D: int,
2025-05-07T20:33:27.2205645Z         scale_ub: Optional[float],
2025-05-07T20:33:27.2205995Z         contiguous: bool,
2025-05-07T20:33:27.2206242Z         compiled: bool,
2025-05-07T20:33:27.2206462Z     ) -> None:
2025-05-07T20:33:27.2206680Z         torch.manual_seed(2025)
2025-05-07T20:33:27.2206929Z     
2025-05-07T20:33:27.2207206Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.2207622Z     
2025-05-07T20:33:27.2207824Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.2208119Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.2208429Z         x = x_sign * x_clamp
2025-05-07T20:33:27.2208674Z         x0 = x[:, :D]
2025-05-07T20:33:27.2208890Z         x1 = x[:, D:]
2025-05-07T20:33:27.2209093Z     
2025-05-07T20:33:27.2209277Z         if contiguous:
2025-05-07T20:33:27.2209507Z             x0 = x0.contiguous()
2025-05-07T20:33:27.2209763Z             x1 = x1.contiguous()
2025-05-07T20:33:27.2210011Z     
2025-05-07T20:33:27.2210211Z         if scale_ub is not None:
2025-05-07T20:33:27.2210484Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.2210825Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.2211146Z             )
2025-05-07T20:33:27.2211328Z         else:
2025-05-07T20:33:27.2211543Z             scale_ub_tensor = None
2025-05-07T20:33:27.2211800Z     
2025-05-07T20:33:27.2212031Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.2212353Z             op = silu_mul_quant
2025-05-07T20:33:27.2212606Z             if compiled:
2025-05-07T20:33:27.2212857Z                 op = torch.compile(op)
2025-05-07T20:33:27.2213155Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.2213436Z     
2025-05-07T20:33:27.2213626Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.2213790Z 
2025-05-07T20:33:27.2213885Z moe/activation_test.py:117: 
2025-05-07T20:33:27.2214192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.2214535Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.2214815Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.2215405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.2216003Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.2216711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.2217450Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.2218072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.2218805Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.2219512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.2220074Z     kernel = self.compile(
2025-05-07T20:33:27.2220647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.2221347Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.2221756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.2222003Z 
2025-05-07T20:33:27.2222214Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe4f6f10>
2025-05-07T20:33:27.2223390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.2224893Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe4fc040>}
2025-05-07T20:33:27.2226418Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.2227593Z context = <triton._C.libtriton.ir.context object at 0x7f58fe2eba70>
2025-05-07T20:33:27.2227904Z 
2025-05-07T20:33:27.2228074Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.2228623Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.2229738Z                            module_map=module_map)
2025-05-07T20:33:27.2230244Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.2230604Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.2230865Z E       ^
2025-05-07T20:33:27.2231349Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.2231842Z 
2025-05-07T20:33:27.2232290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.2232853Z 
2025-05-07T20:33:27.2232952Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.2233379Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.2233794Z     T=128,
2025-05-07T20:33:27.2233981Z     D=7168,
2025-05-07T20:33:27.2234171Z     scale_ub=1200.0,
2025-05-07T20:33:27.2234392Z     contiguous=False,
2025-05-07T20:33:27.2234620Z     compiled=False,
2025-05-07T20:33:27.2234824Z )
2025-05-07T20:33:27.3767055Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.3767626Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:27.3767969Z 
2025-05-07T20:33:27.3768076Z     @given(
2025-05-07T20:33:27.3768412Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.3768752Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.3769073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.3769419Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.3769748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.3770041Z     )
2025-05-07T20:33:27.3770403Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.3770869Z     def test_silu_mul_quant(
2025-05-07T20:33:27.3771111Z         self,
2025-05-07T20:33:27.3771300Z         T: int,
2025-05-07T20:33:27.3771499Z         D: int,
2025-05-07T20:33:27.3771712Z         scale_ub: Optional[float],
2025-05-07T20:33:27.3771986Z         contiguous: bool,
2025-05-07T20:33:27.3772491Z         compiled: bool,
2025-05-07T20:33:27.3772716Z     ) -> None:
2025-05-07T20:33:27.3772934Z         torch.manual_seed(2025)
2025-05-07T20:33:27.3773185Z     
2025-05-07T20:33:27.3773453Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.3773811Z     
2025-05-07T20:33:27.3774011Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.3774303Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.3774627Z         x = x_sign * x_clamp
2025-05-07T20:33:27.3774868Z         x0 = x[:, :D]
2025-05-07T20:33:27.3775077Z         x1 = x[:, D:]
2025-05-07T20:33:27.3775292Z     
2025-05-07T20:33:27.3775478Z         if contiguous:
2025-05-07T20:33:27.3775705Z             x0 = x0.contiguous()
2025-05-07T20:33:27.3775972Z             x1 = x1.contiguous()
2025-05-07T20:33:27.3776216Z     
2025-05-07T20:33:27.3776411Z         if scale_ub is not None:
2025-05-07T20:33:27.3776686Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.3777061Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.3777385Z             )
2025-05-07T20:33:27.3777605Z         else:
2025-05-07T20:33:27.3777836Z             scale_ub_tensor = None
2025-05-07T20:33:27.3778086Z     
2025-05-07T20:33:27.3778318Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.3778750Z             op = silu_mul_quant
2025-05-07T20:33:27.3779071Z             if compiled:
2025-05-07T20:33:27.3779320Z                 op = torch.compile(op)
2025-05-07T20:33:27.3779625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.3779900Z     
2025-05-07T20:33:27.3780094Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.3780261Z 
2025-05-07T20:33:27.3780364Z moe/activation_test.py:117: 
2025-05-07T20:33:27.3780743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.3781090Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.3781378Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.3782121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.3783137Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.3783711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.3784449Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.3785160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.3785728Z     kernel = self.compile(
2025-05-07T20:33:27.3786300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.3787005Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.3787413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.3787664Z 
2025-05-07T20:33:27.3787874Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe4cfa60>
2025-05-07T20:33:27.3789046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.3790679Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe4fcd30>}
2025-05-07T20:33:27.3792149Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.3793275Z context = <triton._C.libtriton.ir.context object at 0x7f58fded16f0>
2025-05-07T20:33:27.3793586Z 
2025-05-07T20:33:27.3793832Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.3794391Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.3794876Z                            module_map=module_map)
2025-05-07T20:33:27.3795251Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.3795614Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.3795883Z E       ^
2025-05-07T20:33:27.3796366Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.3796864Z 
2025-05-07T20:33:27.3797312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.3797869Z 
2025-05-07T20:33:27.3797983Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.3798401Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.3798825Z     T=128,
2025-05-07T20:33:27.3799011Z     D=5120,
2025-05-07T20:33:27.3799202Z     scale_ub=None,
2025-05-07T20:33:27.3799412Z     contiguous=False,
2025-05-07T20:33:27.3799641Z     compiled=False,
2025-05-07T20:33:27.3799851Z )
2025-05-07T20:33:27.3800170Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.3800767Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:27.3801114Z 
2025-05-07T20:33:27.3801200Z     @given(
2025-05-07T20:33:27.3801423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.3801748Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.3802057Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.3802405Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.3802806Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.3803104Z     )
2025-05-07T20:33:27.3803460Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.3803923Z     def test_silu_mul_quant(
2025-05-07T20:33:27.3804166Z         self,
2025-05-07T20:33:27.3804350Z         T: int,
2025-05-07T20:33:27.3804551Z         D: int,
2025-05-07T20:33:27.3804773Z         scale_ub: Optional[float],
2025-05-07T20:33:27.3805045Z         contiguous: bool,
2025-05-07T20:33:27.3805294Z         compiled: bool,
2025-05-07T20:33:27.3805519Z     ) -> None:
2025-05-07T20:33:27.3805734Z         torch.manual_seed(2025)
2025-05-07T20:33:27.3805982Z     
2025-05-07T20:33:27.3806258Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.3806609Z     
2025-05-07T20:33:27.3806805Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.3807099Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.3807425Z         x = x_sign * x_clamp
2025-05-07T20:33:27.3807687Z         x0 = x[:, :D]
2025-05-07T20:33:27.3807925Z         x1 = x[:, D:]
2025-05-07T20:33:27.3808134Z     
2025-05-07T20:33:27.3808319Z         if contiguous:
2025-05-07T20:33:27.3808550Z             x0 = x0.contiguous()
2025-05-07T20:33:27.3808813Z             x1 = x1.contiguous()
2025-05-07T20:33:27.3809050Z     
2025-05-07T20:33:27.3809241Z         if scale_ub is not None:
2025-05-07T20:33:27.3809521Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.3809857Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.3810181Z             )
2025-05-07T20:33:27.3810372Z         else:
2025-05-07T20:33:27.3810575Z             scale_ub_tensor = None
2025-05-07T20:33:27.3810835Z     
2025-05-07T20:33:27.3811068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.3811385Z             op = silu_mul_quant
2025-05-07T20:33:27.3811641Z             if compiled:
2025-05-07T20:33:27.3811894Z                 op = torch.compile(op)
2025-05-07T20:33:27.3812198Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.3812475Z     
2025-05-07T20:33:27.3812722Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.3812888Z 
2025-05-07T20:33:27.3812990Z moe/activation_test.py:117: 
2025-05-07T20:33:27.3813285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.3813631Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.3813916Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.3814650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.3815396Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.3815961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.3816694Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.3817402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.3817967Z     kernel = self.compile(
2025-05-07T20:33:27.3818546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.3819240Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.3819652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.3819949Z 
2025-05-07T20:33:27.3820161Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fde6f670>
2025-05-07T20:33:27.3821372Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.3822874Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdde2310>}
2025-05-07T20:33:27.3824380Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.3825487Z context = <triton._C.libtriton.ir.context object at 0x7f58fddd4bb0>
2025-05-07T20:33:27.3825798Z 
2025-05-07T20:33:27.3825971Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.3826522Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.3827009Z                            module_map=module_map)
2025-05-07T20:33:27.3827389Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.3827754Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.3828019Z E       ^
2025-05-07T20:33:27.3828513Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.3829012Z 
2025-05-07T20:33:27.3829462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.3830157Z 
2025-05-07T20:33:27.3830266Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.3830684Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.3831106Z     T=128,
2025-05-07T20:33:27.3831293Z     D=5120,
2025-05-07T20:33:27.3831476Z     scale_ub=1200.0,
2025-05-07T20:33:27.3831702Z     contiguous=True,
2025-05-07T20:33:27.3831930Z     compiled=False,
2025-05-07T20:33:27.3832129Z )
2025-05-07T20:33:27.6129379Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.6129928Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:27.6130302Z 
2025-05-07T20:33:27.6130412Z     @given(
2025-05-07T20:33:27.6130726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.6131048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.6131602Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.6131943Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.6132279Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.6132565Z     )
2025-05-07T20:33:27.6132929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.6133396Z     def test_silu_mul_quant(
2025-05-07T20:33:27.6133644Z         self,
2025-05-07T20:33:27.6133839Z         T: int,
2025-05-07T20:33:27.6134037Z         D: int,
2025-05-07T20:33:27.6134254Z         scale_ub: Optional[float],
2025-05-07T20:33:27.6134529Z         contiguous: bool,
2025-05-07T20:33:27.6134772Z         compiled: bool,
2025-05-07T20:33:27.6135002Z     ) -> None:
2025-05-07T20:33:27.6135213Z         torch.manual_seed(2025)
2025-05-07T20:33:27.6135455Z     
2025-05-07T20:33:27.6135727Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.6136075Z     
2025-05-07T20:33:27.6136271Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.6136568Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.6136883Z         x = x_sign * x_clamp
2025-05-07T20:33:27.6137132Z         x0 = x[:, :D]
2025-05-07T20:33:27.6137355Z         x1 = x[:, D:]
2025-05-07T20:33:27.6137560Z     
2025-05-07T20:33:27.6137874Z         if contiguous:
2025-05-07T20:33:27.6138113Z             x0 = x0.contiguous()
2025-05-07T20:33:27.6138453Z             x1 = x1.contiguous()
2025-05-07T20:33:27.6138695Z     
2025-05-07T20:33:27.6138896Z         if scale_ub is not None:
2025-05-07T20:33:27.6139167Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.6139510Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.6139833Z             )
2025-05-07T20:33:27.6140103Z         else:
2025-05-07T20:33:27.6140318Z             scale_ub_tensor = None
2025-05-07T20:33:27.6140583Z     
2025-05-07T20:33:27.6140812Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.6141133Z             op = silu_mul_quant
2025-05-07T20:33:27.6141390Z             if compiled:
2025-05-07T20:33:27.6141639Z                 op = torch.compile(op)
2025-05-07T20:33:27.6141937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.6142221Z     
2025-05-07T20:33:27.6142419Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.6142584Z 
2025-05-07T20:33:27.6142681Z moe/activation_test.py:117: 
2025-05-07T20:33:27.6142982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.6143328Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.6143608Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.6144366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.6145116Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.6145689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.6146420Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.6147124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.6147694Z     kernel = self.compile(
2025-05-07T20:33:27.6148269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.6148970Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.6149381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.6149626Z 
2025-05-07T20:33:27.6149975Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fde556d0>
2025-05-07T20:33:27.6151203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.6152719Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdde2ee0>}
2025-05-07T20:33:27.6154183Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.6155295Z context = <triton._C.libtriton.ir.context object at 0x7f58fe2864b0>
2025-05-07T20:33:27.6155605Z 
2025-05-07T20:33:27.6155773Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.6156322Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.6156810Z                            module_map=module_map)
2025-05-07T20:33:27.6157191Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.6157557Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.6157815Z E       ^
2025-05-07T20:33:27.6158307Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.6158801Z 
2025-05-07T20:33:27.6159296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.6159890Z 
2025-05-07T20:33:27.6160003Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.6160422Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.6160838Z     T=1,
2025-05-07T20:33:27.6161021Z     D=7168,
2025-05-07T20:33:27.6161208Z     scale_ub=1200.0,
2025-05-07T20:33:27.6161476Z     contiguous=True,
2025-05-07T20:33:27.6161706Z     compiled=True,
2025-05-07T20:33:27.6161910Z )
2025-05-07T20:33:27.6162230Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.6162747Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:27.6163020Z 
2025-05-07T20:33:27.6163104Z     @given(
2025-05-07T20:33:27.6163332Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.6163655Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.6163974Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.6164314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.6164653Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.6164949Z     )
2025-05-07T20:33:27.6165312Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.6165776Z     def test_silu_mul_quant(
2025-05-07T20:33:27.6166020Z         self,
2025-05-07T20:33:27.6166213Z         T: int,
2025-05-07T20:33:27.6166405Z         D: int,
2025-05-07T20:33:27.6166621Z         scale_ub: Optional[float],
2025-05-07T20:33:27.6166895Z         contiguous: bool,
2025-05-07T20:33:27.6167132Z         compiled: bool,
2025-05-07T20:33:27.6167355Z     ) -> None:
2025-05-07T20:33:27.6167568Z         torch.manual_seed(2025)
2025-05-07T20:33:27.6167808Z     
2025-05-07T20:33:27.6168082Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.6168437Z     
2025-05-07T20:33:27.6168623Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.6168916Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.6169236Z         x = x_sign * x_clamp
2025-05-07T20:33:27.6169489Z         x0 = x[:, :D]
2025-05-07T20:33:27.6176546Z         x1 = x[:, D:]
2025-05-07T20:33:27.6176779Z     
2025-05-07T20:33:27.6176967Z         if contiguous:
2025-05-07T20:33:27.6177212Z             x0 = x0.contiguous()
2025-05-07T20:33:27.6177519Z             x1 = x1.contiguous()
2025-05-07T20:33:27.6177786Z     
2025-05-07T20:33:27.6177984Z         if scale_ub is not None:
2025-05-07T20:33:27.6178272Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.6178699Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.6179032Z             )
2025-05-07T20:33:27.6179234Z         else:
2025-05-07T20:33:27.6179443Z             scale_ub_tensor = None
2025-05-07T20:33:27.6179707Z     
2025-05-07T20:33:27.6179945Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.6180276Z             op = silu_mul_quant
2025-05-07T20:33:27.6180541Z             if compiled:
2025-05-07T20:33:27.6180798Z                 op = torch.compile(op)
2025-05-07T20:33:27.6181111Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.6181395Z     
2025-05-07T20:33:27.6181601Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.6181771Z 
2025-05-07T20:33:27.6181882Z moe/activation_test.py:117: 
2025-05-07T20:33:27.6182184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.6182541Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.6183138Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.6183744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.6184356Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.6185170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.6185939Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.6186573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.6187300Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.6188010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.6188633Z     kernel = self.compile(
2025-05-07T20:33:27.6189206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.6190030Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.6190449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.6190690Z 
2025-05-07T20:33:27.6190905Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fddd3610>
2025-05-07T20:33:27.6192072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.6193578Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe252940>}
2025-05-07T20:33:27.6195052Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.6196177Z context = <triton._C.libtriton.ir.context object at 0x7f58fe42c3b0>
2025-05-07T20:33:27.6196483Z 
2025-05-07T20:33:27.6196662Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.6197210Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.6197703Z                            module_map=module_map)
2025-05-07T20:33:27.6198084Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.6198448Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.6198707Z E       ^
2025-05-07T20:33:27.6199203Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.6199695Z 
2025-05-07T20:33:27.6200150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.6200703Z 
2025-05-07T20:33:27.6200876Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.6201304Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.6201732Z     T=1,
2025-05-07T20:33:27.6201915Z     D=7168,
2025-05-07T20:33:27.6202099Z     scale_ub=1200.0,
2025-05-07T20:33:27.6202324Z     contiguous=False,
2025-05-07T20:33:27.6202555Z     compiled=True,
2025-05-07T20:33:27.6202756Z )
2025-05-07T20:33:27.9555913Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.9557319Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:27.9557756Z 
2025-05-07T20:33:27.9557857Z     @given(
2025-05-07T20:33:27.9558110Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.9558446Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.9558760Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.9559095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.9559441Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.9559740Z     )
2025-05-07T20:33:27.9560098Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.9560566Z     def test_silu_mul_quant(
2025-05-07T20:33:27.9560809Z         self,
2025-05-07T20:33:27.9561239Z         T: int,
2025-05-07T20:33:27.9561439Z         D: int,
2025-05-07T20:33:27.9561741Z         scale_ub: Optional[float],
2025-05-07T20:33:27.9562018Z         contiguous: bool,
2025-05-07T20:33:27.9562256Z         compiled: bool,
2025-05-07T20:33:27.9562484Z     ) -> None:
2025-05-07T20:33:27.9562706Z         torch.manual_seed(2025)
2025-05-07T20:33:27.9562945Z     
2025-05-07T20:33:27.9563223Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.9563662Z     
2025-05-07T20:33:27.9563857Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.9564153Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.9564480Z         x = x_sign * x_clamp
2025-05-07T20:33:27.9564721Z         x0 = x[:, :D]
2025-05-07T20:33:27.9564939Z         x1 = x[:, D:]
2025-05-07T20:33:27.9565157Z     
2025-05-07T20:33:27.9565341Z         if contiguous:
2025-05-07T20:33:27.9565576Z             x0 = x0.contiguous()
2025-05-07T20:33:27.9565843Z             x1 = x1.contiguous()
2025-05-07T20:33:27.9566085Z     
2025-05-07T20:33:27.9566275Z         if scale_ub is not None:
2025-05-07T20:33:27.9566564Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.9566907Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.9567230Z             )
2025-05-07T20:33:27.9567433Z         else:
2025-05-07T20:33:27.9567671Z             scale_ub_tensor = None
2025-05-07T20:33:27.9567945Z     
2025-05-07T20:33:27.9568180Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.9568505Z             op = silu_mul_quant
2025-05-07T20:33:27.9568751Z             if compiled:
2025-05-07T20:33:27.9569003Z                 op = torch.compile(op)
2025-05-07T20:33:27.9569307Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.9569590Z     
2025-05-07T20:33:27.9569782Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.9569951Z 
2025-05-07T20:33:27.9570055Z moe/activation_test.py:117: 
2025-05-07T20:33:27.9570353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.9570705Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.9570990Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.9571588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.9572183Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.9572887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.9573635Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.9574274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.9575011Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.9575723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.9576293Z     kernel = self.compile(
2025-05-07T20:33:27.9576859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.9577564Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.9577977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.9578219Z 
2025-05-07T20:33:27.9578442Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe41c550>
2025-05-07T20:33:27.9579610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.9581128Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe1f15e0>}
2025-05-07T20:33:27.9582650Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.9584073Z context = <triton._C.libtriton.ir.context object at 0x7f58fe08d6f0>
2025-05-07T20:33:27.9584381Z 
2025-05-07T20:33:27.9584550Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.9585175Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.9585668Z                            module_map=module_map)
2025-05-07T20:33:27.9586047Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.9586404Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.9586666Z E       ^
2025-05-07T20:33:27.9587160Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.9587649Z 
2025-05-07T20:33:27.9588098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.9588666Z 
2025-05-07T20:33:27.9588771Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.9589206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.9589629Z     T=1,
2025-05-07T20:33:27.9589968Z     D=7168,
2025-05-07T20:33:27.9590164Z     scale_ub=None,
2025-05-07T20:33:27.9590378Z     contiguous=False,
2025-05-07T20:33:27.9590596Z     compiled=True,
2025-05-07T20:33:27.9590815Z )
2025-05-07T20:33:28.0723435Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.0724098Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.0724540Z 
2025-05-07T20:33:28.0724656Z     @given(
2025-05-07T20:33:28.0724965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.0725422Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.0725740Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.0726088Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.0726425Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.0726724Z     )
2025-05-07T20:33:28.0727077Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.0727543Z     def test_silu_mul_quant(
2025-05-07T20:33:28.0727791Z         self,
2025-05-07T20:33:28.0727978Z         T: int,
2025-05-07T20:33:28.0728174Z         D: int,
2025-05-07T20:33:28.0728394Z         scale_ub: Optional[float],
2025-05-07T20:33:28.0728888Z         contiguous: bool,
2025-05-07T20:33:28.0729139Z         compiled: bool,
2025-05-07T20:33:28.0729374Z     ) -> None:
2025-05-07T20:33:28.0729593Z         torch.manual_seed(2025)
2025-05-07T20:33:28.0729830Z     
2025-05-07T20:33:28.0730107Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.0730467Z     
2025-05-07T20:33:28.0730656Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.0730955Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.0731275Z         x = x_sign * x_clamp
2025-05-07T20:33:28.0731510Z         x0 = x[:, :D]
2025-05-07T20:33:28.0731730Z         x1 = x[:, D:]
2025-05-07T20:33:28.0731940Z     
2025-05-07T20:33:28.0732117Z         if contiguous:
2025-05-07T20:33:28.0732356Z             x0 = x0.contiguous()
2025-05-07T20:33:28.0732627Z             x1 = x1.contiguous()
2025-05-07T20:33:28.0732867Z     
2025-05-07T20:33:28.0733063Z         if scale_ub is not None:
2025-05-07T20:33:28.0733350Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.0733687Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.0734010Z             )
2025-05-07T20:33:28.0734202Z         else:
2025-05-07T20:33:28.0734406Z             scale_ub_tensor = None
2025-05-07T20:33:28.0734679Z     
2025-05-07T20:33:28.0735020Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.0735351Z             op = silu_mul_quant
2025-05-07T20:33:28.0735712Z             if compiled:
2025-05-07T20:33:28.0735961Z                 op = torch.compile(op)
2025-05-07T20:33:28.0736267Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.0736544Z     
2025-05-07T20:33:28.0736737Z         y_fp8, y_scale = fn()
2025-05-07T20:33:28.0737027Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:28.0737397Z     
2025-05-07T20:33:28.0737628Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.0737975Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:28.0738282Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:28.0738602Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:28.0738978Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.0739304Z     
2025-05-07T20:33:28.0739500Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.0739712Z 
2025-05-07T20:33:28.0739812Z moe/activation_test.py:126: 
2025-05-07T20:33:28.0740121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.0740471Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:28.0740802Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.0741648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:28.0742464Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.0743038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.0743768Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.0744504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:28.0745277Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.0746084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:28.0746890Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.0747673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:28.0748360Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.0749047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:28.0749604Z     fn()
2025-05-07T20:33:28.0750281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:28.0750905Z     self.fn.run(
2025-05-07T20:33:28.0751407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.0751980Z     kernel = self.compile(
2025-05-07T20:33:28.0752556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.0753258Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.0753681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.0753930Z 
2025-05-07T20:33:28.0754156Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe0a9160>
2025-05-07T20:33:28.0755333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.0756910Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe034160>}
2025-05-07T20:33:28.0758466Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.0759574Z context = <triton._C.libtriton.ir.context object at 0x7f58fe03b870>
2025-05-07T20:33:28.0759924Z 
2025-05-07T20:33:28.0760101Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.0760646Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.0761142Z                            module_map=module_map)
2025-05-07T20:33:28.0761518Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.0761888Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.0762155Z E       ^
2025-05-07T20:33:28.0762656Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.0763146Z 
2025-05-07T20:33:28.0763600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.0764154Z 
2025-05-07T20:33:28.0764258Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.0764687Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.0765115Z     T=1,
2025-05-07T20:33:28.0765302Z     D=5120,
2025-05-07T20:33:28.0765490Z     scale_ub=1200.0,
2025-05-07T20:33:28.0765718Z     contiguous=False,
2025-05-07T20:33:28.0765946Z     compiled=True,
2025-05-07T20:33:28.0766147Z )
2025-05-07T20:33:28.2761542Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.2762157Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.2762456Z 
2025-05-07T20:33:28.2762541Z     @given(
2025-05-07T20:33:28.2762812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.2763138Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.2763463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.2763807Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.2764148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.2764448Z     )
2025-05-07T20:33:28.2764815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.2765293Z     def test_silu_mul_quant(
2025-05-07T20:33:28.2765538Z         self,
2025-05-07T20:33:28.2765732Z         T: int,
2025-05-07T20:33:28.2766218Z         D: int,
2025-05-07T20:33:28.2766436Z         scale_ub: Optional[float],
2025-05-07T20:33:28.2766716Z         contiguous: bool,
2025-05-07T20:33:28.2766962Z         compiled: bool,
2025-05-07T20:33:28.2767194Z     ) -> None:
2025-05-07T20:33:28.2767420Z         torch.manual_seed(2025)
2025-05-07T20:33:28.2767679Z     
2025-05-07T20:33:28.2767964Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.2768342Z     
2025-05-07T20:33:28.2768544Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.2768843Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.2769175Z         x = x_sign * x_clamp
2025-05-07T20:33:28.2769428Z         x0 = x[:, :D]
2025-05-07T20:33:28.2769651Z         x1 = x[:, D:]
2025-05-07T20:33:28.2769875Z     
2025-05-07T20:33:28.2770068Z         if contiguous:
2025-05-07T20:33:28.2770312Z             x0 = x0.contiguous()
2025-05-07T20:33:28.2770583Z             x1 = x1.contiguous()
2025-05-07T20:33:28.2770837Z     
2025-05-07T20:33:28.2771039Z         if scale_ub is not None:
2025-05-07T20:33:28.2771324Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.2771682Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.2772015Z             )
2025-05-07T20:33:28.2772210Z         else:
2025-05-07T20:33:28.2772516Z             scale_ub_tensor = None
2025-05-07T20:33:28.2772788Z     
2025-05-07T20:33:28.2773088Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.2773421Z             op = silu_mul_quant
2025-05-07T20:33:28.2773684Z             if compiled:
2025-05-07T20:33:28.2773928Z                 op = torch.compile(op)
2025-05-07T20:33:28.2774237Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.2774526Z     
2025-05-07T20:33:28.2774802Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.2774981Z 
2025-05-07T20:33:28.2775081Z moe/activation_test.py:117: 
2025-05-07T20:33:28.2775393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.2775746Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.2776028Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.2776626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.2777227Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.2777933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.2778678Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.2779245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.2779978Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.2780687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.2781259Z     kernel = self.compile(
2025-05-07T20:33:28.2781836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.2782534Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.2783120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.2783371Z 
2025-05-07T20:33:28.2783587Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe2baf10>
2025-05-07T20:33:28.2784765Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.2786330Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe034b80>}
2025-05-07T20:33:28.2787877Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.2788981Z context = <triton._C.libtriton.ir.context object at 0x7f58fe2ae9b0>
2025-05-07T20:33:28.2789292Z 
2025-05-07T20:33:28.2789465Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.2790162Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.2790659Z                            module_map=module_map)
2025-05-07T20:33:28.2791036Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.2791408Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.2791682Z E       ^
2025-05-07T20:33:28.2792179Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.2792677Z 
2025-05-07T20:33:28.2793130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.2793696Z 
2025-05-07T20:33:28.2793801Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.2794235Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.2794657Z     T=1,
2025-05-07T20:33:28.2794914Z     D=5120,
2025-05-07T20:33:28.2795111Z     scale_ub=1200.0,
2025-05-07T20:33:28.2795390Z     contiguous=False,
2025-05-07T20:33:28.2795617Z     compiled=False,
2025-05-07T20:33:28.2795828Z )
2025-05-07T20:33:28.2796148Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.2796666Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.2797010Z 
2025-05-07T20:33:28.2797093Z     @given(
2025-05-07T20:33:28.2797319Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.2797642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.2797962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.2798302Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.2798635Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.2798935Z     )
2025-05-07T20:33:28.2799303Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.2799765Z     def test_silu_mul_quant(
2025-05-07T20:33:28.2800015Z         self,
2025-05-07T20:33:28.2800213Z         T: int,
2025-05-07T20:33:28.2800403Z         D: int,
2025-05-07T20:33:28.2800626Z         scale_ub: Optional[float],
2025-05-07T20:33:28.2800902Z         contiguous: bool,
2025-05-07T20:33:28.2801138Z         compiled: bool,
2025-05-07T20:33:28.2801361Z     ) -> None:
2025-05-07T20:33:28.2801581Z         torch.manual_seed(2025)
2025-05-07T20:33:28.2801821Z     
2025-05-07T20:33:28.2802092Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.2802447Z     
2025-05-07T20:33:28.2802640Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.2802929Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.2803248Z         x = x_sign * x_clamp
2025-05-07T20:33:28.2803491Z         x0 = x[:, :D]
2025-05-07T20:33:28.2803703Z         x1 = x[:, D:]
2025-05-07T20:33:28.2803912Z     
2025-05-07T20:33:28.2804101Z         if contiguous:
2025-05-07T20:33:28.2804326Z             x0 = x0.contiguous()
2025-05-07T20:33:28.2804589Z             x1 = x1.contiguous()
2025-05-07T20:33:28.2804835Z     
2025-05-07T20:33:28.2805022Z         if scale_ub is not None:
2025-05-07T20:33:28.2805299Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.2805642Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.2805957Z             )
2025-05-07T20:33:28.2806152Z         else:
2025-05-07T20:33:28.2806361Z             scale_ub_tensor = None
2025-05-07T20:33:28.2806616Z     
2025-05-07T20:33:28.2806846Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.2807221Z             op = silu_mul_quant
2025-05-07T20:33:28.2807475Z             if compiled:
2025-05-07T20:33:28.2807719Z                 op = torch.compile(op)
2025-05-07T20:33:28.2808071Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.2808363Z     
2025-05-07T20:33:28.2808553Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.2808732Z 
2025-05-07T20:33:28.2808829Z moe/activation_test.py:117: 
2025-05-07T20:33:28.2816171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.2816561Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.2816856Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.2817615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.2818417Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.2819002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.2819736Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.2820452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.2821019Z     kernel = self.compile(
2025-05-07T20:33:28.2821674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.2822419Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.2822840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.2823085Z 
2025-05-07T20:33:28.2823302Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe2a3be0>
2025-05-07T20:33:28.2824526Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.2826036Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe2b2550>}
2025-05-07T20:33:28.2827507Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.2828614Z context = <triton._C.libtriton.ir.context object at 0x7f58fdae5270>
2025-05-07T20:33:28.2828929Z 
2025-05-07T20:33:28.2829102Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.2829662Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.2830264Z                            module_map=module_map)
2025-05-07T20:33:28.2830645Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.2831016Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.2831286Z E       ^
2025-05-07T20:33:28.2831777Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.2832273Z 
2025-05-07T20:33:28.2832725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.2833296Z 
2025-05-07T20:33:28.2833399Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.2833835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.2834260Z     T=16384,
2025-05-07T20:33:28.2834466Z     D=5120,
2025-05-07T20:33:28.2834666Z     scale_ub=1200.0,
2025-05-07T20:33:28.2834896Z     contiguous=False,
2025-05-07T20:33:28.2835129Z     compiled=True,
2025-05-07T20:33:28.2835340Z )
2025-05-07T20:33:28.4007722Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.4008660Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.4008980Z 
2025-05-07T20:33:28.4009066Z     @given(
2025-05-07T20:33:28.4009309Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.4009644Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.4009976Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.4010321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.4010670Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.4010970Z     )
2025-05-07T20:33:28.4011340Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.4011814Z     def test_silu_mul_quant(
2025-05-07T20:33:28.4012070Z         self,
2025-05-07T20:33:28.4012266Z         T: int,
2025-05-07T20:33:28.4012472Z         D: int,
2025-05-07T20:33:28.4012698Z         scale_ub: Optional[float],
2025-05-07T20:33:28.4012975Z         contiguous: bool,
2025-05-07T20:33:28.4013228Z         compiled: bool,
2025-05-07T20:33:28.4013460Z     ) -> None:
2025-05-07T20:33:28.4013684Z         torch.manual_seed(2025)
2025-05-07T20:33:28.4013932Z     
2025-05-07T20:33:28.4014215Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.4014583Z     
2025-05-07T20:33:28.4014847Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.4015150Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.4015528Z         x = x_sign * x_clamp
2025-05-07T20:33:28.4015772Z         x0 = x[:, :D]
2025-05-07T20:33:28.4015994Z         x1 = x[:, D:]
2025-05-07T20:33:28.4016202Z     
2025-05-07T20:33:28.4016386Z         if contiguous:
2025-05-07T20:33:28.4016621Z             x0 = x0.contiguous()
2025-05-07T20:33:28.4016944Z             x1 = x1.contiguous()
2025-05-07T20:33:28.4017183Z     
2025-05-07T20:33:28.4017370Z         if scale_ub is not None:
2025-05-07T20:33:28.4017647Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.4017985Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.4018303Z             )
2025-05-07T20:33:28.4018497Z         else:
2025-05-07T20:33:28.4018712Z             scale_ub_tensor = None
2025-05-07T20:33:28.4018963Z     
2025-05-07T20:33:28.4019191Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.4019519Z             op = silu_mul_quant
2025-05-07T20:33:28.4019768Z             if compiled:
2025-05-07T20:33:28.4020016Z                 op = torch.compile(op)
2025-05-07T20:33:28.4020324Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.4020605Z     
2025-05-07T20:33:28.4020796Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.4020962Z 
2025-05-07T20:33:28.4021063Z moe/activation_test.py:117: 
2025-05-07T20:33:28.4021364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.4021711Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.4022001Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.4022594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.4023186Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.4023896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.4024640Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.4025202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.4025930Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.4026642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.4027212Z     kernel = self.compile(
2025-05-07T20:33:28.4027774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.4028527Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.4028943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.4029181Z 
2025-05-07T20:33:28.4029401Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdb636d0>
2025-05-07T20:33:28.4030704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.4032213Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe22a1f0>}
2025-05-07T20:33:28.4033687Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.4034800Z context = <triton._C.libtriton.ir.context object at 0x7f58fe22b2b0>
2025-05-07T20:33:28.4035106Z 
2025-05-07T20:33:28.4035277Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.4035882Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.4036372Z                            module_map=module_map)
2025-05-07T20:33:28.4036787Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.4037143Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.4037408Z E       ^
2025-05-07T20:33:28.4037902Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.4038431Z 
2025-05-07T20:33:28.4038881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.4039444Z 
2025-05-07T20:33:28.4039547Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.4039974Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.4040391Z     T=2048,
2025-05-07T20:33:28.4040567Z     D=7168,
2025-05-07T20:33:28.4040754Z     scale_ub=1200.0,
2025-05-07T20:33:28.4040979Z     contiguous=False,
2025-05-07T20:33:28.4041197Z     compiled=True,
2025-05-07T20:33:28.4041402Z )
2025-05-07T20:33:28.4041728Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.4042242Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.4042535Z 
2025-05-07T20:33:28.4042612Z     @given(
2025-05-07T20:33:28.4042840Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.4043164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.4043473Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.4043817Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.4044155Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.4044444Z     )
2025-05-07T20:33:28.4044805Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.4045271Z     def test_silu_mul_quant(
2025-05-07T20:33:28.4045509Z         self,
2025-05-07T20:33:28.4045707Z         T: int,
2025-05-07T20:33:28.4045901Z         D: int,
2025-05-07T20:33:28.4046115Z         scale_ub: Optional[float],
2025-05-07T20:33:28.4046388Z         contiguous: bool,
2025-05-07T20:33:28.4046631Z         compiled: bool,
2025-05-07T20:33:28.4046853Z     ) -> None:
2025-05-07T20:33:28.4047068Z         torch.manual_seed(2025)
2025-05-07T20:33:28.4047313Z     
2025-05-07T20:33:28.4047582Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.4047950Z     
2025-05-07T20:33:28.4048142Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.4048438Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.4048821Z         x = x_sign * x_clamp
2025-05-07T20:33:28.4049066Z         x0 = x[:, :D]
2025-05-07T20:33:28.4049284Z         x1 = x[:, D:]
2025-05-07T20:33:28.4049486Z     
2025-05-07T20:33:28.4049671Z         if contiguous:
2025-05-07T20:33:28.4049900Z             x0 = x0.contiguous()
2025-05-07T20:33:28.4050156Z             x1 = x1.contiguous()
2025-05-07T20:33:28.4050399Z     
2025-05-07T20:33:28.4050588Z         if scale_ub is not None:
2025-05-07T20:33:28.4050867Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.4051200Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.4051518Z             )
2025-05-07T20:33:28.4051710Z         else:
2025-05-07T20:33:28.4051919Z             scale_ub_tensor = None
2025-05-07T20:33:28.4052179Z     
2025-05-07T20:33:28.4052403Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.4052720Z             op = silu_mul_quant
2025-05-07T20:33:28.4052973Z             if compiled:
2025-05-07T20:33:28.4053225Z                 op = torch.compile(op)
2025-05-07T20:33:28.4053520Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.4053800Z     
2025-05-07T20:33:28.4053989Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.4054156Z 
2025-05-07T20:33:28.4054255Z moe/activation_test.py:117: 
2025-05-07T20:33:28.4054606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.4054989Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.4055278Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.4055859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.4056456Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.4057163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.4057972Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.4058540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.4059277Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.4059986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.4060549Z     kernel = self.compile(
2025-05-07T20:33:28.4061124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.4061821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.4062234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.4062476Z 
2025-05-07T20:33:28.4062693Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe236f10>
2025-05-07T20:33:28.4063864Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.4065363Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe22aee0>}
2025-05-07T20:33:28.4066828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.4067933Z context = <triton._C.libtriton.ir.context object at 0x7f58fe1610b0>
2025-05-07T20:33:28.4068243Z 
2025-05-07T20:33:28.4068418Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.4068976Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.4069467Z                            module_map=module_map)
2025-05-07T20:33:28.4070025Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.4070392Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.4070653Z E       ^
2025-05-07T20:33:28.4071151Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.4071648Z 
2025-05-07T20:33:28.4072097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.4072655Z 
2025-05-07T20:33:28.6752440Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6753796Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6754517Z     T=1,
2025-05-07T20:33:28.6754810Z     D=5120,
2025-05-07T20:33:28.6755146Z     scale_ub=None,
2025-05-07T20:33:28.6755507Z     contiguous=False,
2025-05-07T20:33:28.6755857Z     compiled=False,
2025-05-07T20:33:28.6756197Z )
2025-05-07T20:33:28.6756726Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6757559Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.6757985Z 
2025-05-07T20:33:28.6758105Z     @given(
2025-05-07T20:33:28.6758498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6759295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6759803Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6760479Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6761023Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6761478Z     )
2025-05-07T20:33:28.6762061Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6762967Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6763344Z         self,
2025-05-07T20:33:28.6763643Z         T: int,
2025-05-07T20:33:28.6763945Z         D: int,
2025-05-07T20:33:28.6764290Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6764736Z         contiguous: bool,
2025-05-07T20:33:28.6765109Z         compiled: bool,
2025-05-07T20:33:28.6765466Z     ) -> None:
2025-05-07T20:33:28.6765801Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6766181Z     
2025-05-07T20:33:28.6766623Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6767207Z     
2025-05-07T20:33:28.6767510Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6767986Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6768504Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6768881Z         x0 = x[:, :D]
2025-05-07T20:33:28.6769227Z         x1 = x[:, D:]
2025-05-07T20:33:28.6769556Z     
2025-05-07T20:33:28.6769844Z         if contiguous:
2025-05-07T20:33:28.6770225Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6770646Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6771037Z     
2025-05-07T20:33:28.6771339Z         if scale_ub is not None:
2025-05-07T20:33:28.6771788Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6772347Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6772850Z             )
2025-05-07T20:33:28.6773162Z         else:
2025-05-07T20:33:28.6773499Z             scale_ub_tensor = None
2025-05-07T20:33:28.6773906Z     
2025-05-07T20:33:28.6774311Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6774852Z             op = silu_mul_quant
2025-05-07T20:33:28.6775260Z             if compiled:
2025-05-07T20:33:28.6775677Z                 op = torch.compile(op)
2025-05-07T20:33:28.6776168Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6776620Z     
2025-05-07T20:33:28.6776936Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6777211Z 
2025-05-07T20:33:28.6777388Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6777890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6778446Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6779058Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6780273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6781486Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6782417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6783990Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6785146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6786066Z     kernel = self.compile(
2025-05-07T20:33:28.6786976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6788097Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6788735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6789129Z 
2025-05-07T20:33:28.6789425Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe16ddc0>
2025-05-07T20:33:28.6791595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6794078Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe1705e0>}
2025-05-07T20:33:28.6796444Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6798403Z context = <triton._C.libtriton.ir.context object at 0x7f58fdd5a970>
2025-05-07T20:33:28.6798920Z 
2025-05-07T20:33:28.6799185Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6800069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6800879Z                            module_map=module_map)
2025-05-07T20:33:28.6801475Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6802076Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6802512Z E       ^
2025-05-07T20:33:28.6803304Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6804117Z 
2025-05-07T20:33:28.6804846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6805756Z 
2025-05-07T20:33:28.6805926Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6806623Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6807303Z     T=4096,
2025-05-07T20:33:28.6807608Z     D=7168,
2025-05-07T20:33:28.6807924Z     scale_ub=1200.0,
2025-05-07T20:33:28.6808285Z     contiguous=False,
2025-05-07T20:33:28.6808661Z     compiled=False,
2025-05-07T20:33:28.6808995Z )
2025-05-07T20:33:28.6809520Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.6810366Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.6810850Z 
2025-05-07T20:33:28.6810986Z     @given(
2025-05-07T20:33:28.6811361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.6811879Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.6812377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.6812930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.6813477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.6813967Z     )
2025-05-07T20:33:28.6814665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.6815425Z     def test_silu_mul_quant(
2025-05-07T20:33:28.6815829Z         self,
2025-05-07T20:33:28.6816145Z         T: int,
2025-05-07T20:33:28.6816452Z         D: int,
2025-05-07T20:33:28.6816808Z         scale_ub: Optional[float],
2025-05-07T20:33:28.6817269Z         contiguous: bool,
2025-05-07T20:33:28.6817655Z         compiled: bool,
2025-05-07T20:33:28.6818028Z     ) -> None:
2025-05-07T20:33:28.6818378Z         torch.manual_seed(2025)
2025-05-07T20:33:28.6818754Z     
2025-05-07T20:33:28.6819198Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.6819779Z     
2025-05-07T20:33:28.6820101Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.6820575Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.6821090Z         x = x_sign * x_clamp
2025-05-07T20:33:28.6821487Z         x0 = x[:, :D]
2025-05-07T20:33:28.6821826Z         x1 = x[:, D:]
2025-05-07T20:33:28.6822176Z     
2025-05-07T20:33:28.6822478Z         if contiguous:
2025-05-07T20:33:28.6822845Z             x0 = x0.contiguous()
2025-05-07T20:33:28.6823264Z             x1 = x1.contiguous()
2025-05-07T20:33:28.6823650Z     
2025-05-07T20:33:28.6823951Z         if scale_ub is not None:
2025-05-07T20:33:28.6824480Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.6825046Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.6825612Z             )
2025-05-07T20:33:28.6825929Z         else:
2025-05-07T20:33:28.6826260Z             scale_ub_tensor = None
2025-05-07T20:33:28.6826669Z     
2025-05-07T20:33:28.6827034Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.6827548Z             op = silu_mul_quant
2025-05-07T20:33:28.6828034Z             if compiled:
2025-05-07T20:33:28.6828430Z                 op = torch.compile(op)
2025-05-07T20:33:28.6828922Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6829382Z     
2025-05-07T20:33:28.6829682Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.6830093Z 
2025-05-07T20:33:28.6830247Z moe/activation_test.py:117: 
2025-05-07T20:33:28.6830710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6831226Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.6831665Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.6832855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.6834056Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.6834967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.6836173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.6837338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.6838268Z     kernel = self.compile(
2025-05-07T20:33:28.6839197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.6840341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.6841029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.6841429Z 
2025-05-07T20:33:28.6841767Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe18a970>
2025-05-07T20:33:28.6843643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.6846029Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdab01f0>}
2025-05-07T20:33:28.6848483Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.6850127Z context = <triton._C.libtriton.ir.context object at 0x7f58fdac0d30>
2025-05-07T20:33:28.6850621Z 
2025-05-07T20:33:28.6850894Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.6851801Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.6852609Z                            module_map=module_map)
2025-05-07T20:33:28.6853199Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.6853785Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.6854221Z E       ^
2025-05-07T20:33:28.6855015Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.6855809Z 
2025-05-07T20:33:28.6856534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.6857443Z 
2025-05-07T20:33:28.6857608Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.6858376Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.6859055Z     T=16384,
2025-05-07T20:33:28.6859422Z     D=7168,
2025-05-07T20:33:28.6859734Z     scale_ub=None,
2025-05-07T20:33:28.6860069Z     contiguous=True,
2025-05-07T20:33:28.6860423Z     compiled=True,
2025-05-07T20:33:28.6860752Z )
2025-05-07T20:33:28.9726412Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.9735021Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.9735886Z 
2025-05-07T20:33:28.9736017Z     @given(
2025-05-07T20:33:28.9736399Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.9736943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.9737449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.9738004Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.9738567Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.9739042Z     )
2025-05-07T20:33:28.9739647Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.9740418Z     def test_silu_mul_quant(
2025-05-07T20:33:28.9740821Z         self,
2025-05-07T20:33:28.9741129Z         T: int,
2025-05-07T20:33:28.9741455Z         D: int,
2025-05-07T20:33:28.9741811Z         scale_ub: Optional[float],
2025-05-07T20:33:28.9742250Z         contiguous: bool,
2025-05-07T20:33:28.9742642Z         compiled: bool,
2025-05-07T20:33:28.9743029Z     ) -> None:
2025-05-07T20:33:28.9743373Z         torch.manual_seed(2025)
2025-05-07T20:33:28.9743774Z     
2025-05-07T20:33:28.9744218Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.9744790Z     
2025-05-07T20:33:28.9745109Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.9745590Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.9746101Z         x = x_sign * x_clamp
2025-05-07T20:33:28.9746501Z         x0 = x[:, :D]
2025-05-07T20:33:28.9746863Z         x1 = x[:, D:]
2025-05-07T20:33:28.9747199Z     
2025-05-07T20:33:28.9747507Z         if contiguous:
2025-05-07T20:33:28.9747890Z             x0 = x0.contiguous()
2025-05-07T20:33:28.9748307Z             x1 = x1.contiguous()
2025-05-07T20:33:28.9748711Z     
2025-05-07T20:33:28.9749030Z         if scale_ub is not None:
2025-05-07T20:33:28.9749485Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.9750202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.9750728Z             )
2025-05-07T20:33:28.9751046Z         else:
2025-05-07T20:33:28.9751376Z             scale_ub_tensor = None
2025-05-07T20:33:28.9751789Z     
2025-05-07T20:33:28.9752290Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.9752813Z             op = silu_mul_quant
2025-05-07T20:33:28.9753213Z             if compiled:
2025-05-07T20:33:28.9753595Z                 op = torch.compile(op)
2025-05-07T20:33:28.9754065Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.9754514Z     
2025-05-07T20:33:28.9754816Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.9755084Z 
2025-05-07T20:33:28.9755237Z moe/activation_test.py:117: 
2025-05-07T20:33:28.9755719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.9756265Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.9756716Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.9757672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.9758652Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.9759795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.9760987Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.9761902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.9763214Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.9764461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.9765382Z     kernel = self.compile(
2025-05-07T20:33:28.9766304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.9767438Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.9768172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.9768578Z 
2025-05-07T20:33:28.9768924Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdac1d00>
2025-05-07T20:33:28.9770838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.9773321Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdab0ee0>}
2025-05-07T20:33:28.9775711Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.9777526Z context = <triton._C.libtriton.ir.context object at 0x7f58fe11c030>
2025-05-07T20:33:28.9778029Z 
2025-05-07T20:33:28.9778300Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.9779193Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.9779987Z                            module_map=module_map)
2025-05-07T20:33:28.9780593Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.9781180Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.9781611Z E       ^
2025-05-07T20:33:28.9782397Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.9783550Z 
2025-05-07T20:33:28.9784284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.9785198Z 
2025-05-07T20:33:28.9785374Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.9786068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.9786759Z     T=4096,
2025-05-07T20:33:28.9787064Z     D=5120,
2025-05-07T20:33:28.9787493Z     scale_ub=None,
2025-05-07T20:33:28.9787843Z     contiguous=False,
2025-05-07T20:33:28.9788209Z     compiled=True,
2025-05-07T20:33:28.9788542Z )
2025-05-07T20:33:28.9789061Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.9790050Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.9790526Z 
2025-05-07T20:33:28.9790661Z     @given(
2025-05-07T20:33:28.9791030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.9791558Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.9792068Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.9792613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.9793149Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.9793602Z     )
2025-05-07T20:33:28.9794082Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.9794664Z     def test_silu_mul_quant(
2025-05-07T20:33:28.9795001Z         self,
2025-05-07T20:33:28.9795268Z         T: int,
2025-05-07T20:33:28.9795536Z         D: int,
2025-05-07T20:33:28.9795820Z         scale_ub: Optional[float],
2025-05-07T20:33:28.9796185Z         contiguous: bool,
2025-05-07T20:33:28.9796497Z         compiled: bool,
2025-05-07T20:33:28.9796809Z     ) -> None:
2025-05-07T20:33:28.9797228Z         torch.manual_seed(2025)
2025-05-07T20:33:28.9797629Z     
2025-05-07T20:33:28.9798002Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.9798514Z     
2025-05-07T20:33:28.9798784Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.9799226Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.9799687Z         x = x_sign * x_clamp
2025-05-07T20:33:28.9800166Z         x0 = x[:, :D]
2025-05-07T20:33:28.9800493Z         x1 = x[:, D:]
2025-05-07T20:33:28.9800803Z     
2025-05-07T20:33:28.9801077Z         if contiguous:
2025-05-07T20:33:28.9801404Z             x0 = x0.contiguous()
2025-05-07T20:33:28.9801786Z             x1 = x1.contiguous()
2025-05-07T20:33:28.9802131Z     
2025-05-07T20:33:28.9802400Z         if scale_ub is not None:
2025-05-07T20:33:28.9802797Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.9803296Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.9803731Z             )
2025-05-07T20:33:28.9804035Z         else:
2025-05-07T20:33:28.9804330Z             scale_ub_tensor = None
2025-05-07T20:33:28.9804686Z     
2025-05-07T20:33:28.9805053Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.9805564Z             op = silu_mul_quant
2025-05-07T20:33:28.9805952Z             if compiled:
2025-05-07T20:33:28.9806338Z                 op = torch.compile(op)
2025-05-07T20:33:28.9806813Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.9807249Z     
2025-05-07T20:33:28.9807548Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.9807819Z 
2025-05-07T20:33:28.9807976Z moe/activation_test.py:117: 
2025-05-07T20:33:28.9808454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.9808984Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.9809431Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.9810364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.9811299Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.9812419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.9813595Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.9814429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.9815538Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.9816727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.9817615Z     kernel = self.compile(
2025-05-07T20:33:28.9818509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.9819635Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.9820289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.9820680Z 
2025-05-07T20:33:28.9821027Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdd474f0>
2025-05-07T20:33:28.9822877Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.9825282Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe13a940>}
2025-05-07T20:33:28.9827647Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.9829516Z context = <triton._C.libtriton.ir.context object at 0x7f58fddaf070>
2025-05-07T20:33:28.9830133Z 
2025-05-07T20:33:28.9830492Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.9831360Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.9832101Z                            module_map=module_map)
2025-05-07T20:33:28.9832689Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.9833350Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.9833782Z E       ^
2025-05-07T20:33:28.9834594Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.9835386Z 
2025-05-07T20:33:28.9836123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.9837024Z 
2025-05-07T20:33:29.1778461Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.1779248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.1779950Z     T=4096,
2025-05-07T20:33:29.1780258Z     D=5120,
2025-05-07T20:33:29.1780563Z     scale_ub=1200.0,
2025-05-07T20:33:29.1780931Z     contiguous=False,
2025-05-07T20:33:29.1781294Z     compiled=False,
2025-05-07T20:33:29.1781606Z )
2025-05-07T20:33:29.1782137Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.1783219Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:29.1783716Z 
2025-05-07T20:33:29.1783844Z     @given(
2025-05-07T20:33:29.1784229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.1784747Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.1785258Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.1785813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.1786356Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.1786844Z     )
2025-05-07T20:33:29.1787434Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.1788190Z     def test_silu_mul_quant(
2025-05-07T20:33:29.1788591Z         self,
2025-05-07T20:33:29.1788904Z         T: int,
2025-05-07T20:33:29.1789226Z         D: int,
2025-05-07T20:33:29.1789576Z         scale_ub: Optional[float],
2025-05-07T20:33:29.1790155Z         contiguous: bool,
2025-05-07T20:33:29.1790554Z         compiled: bool,
2025-05-07T20:33:29.1790912Z     ) -> None:
2025-05-07T20:33:29.1791262Z         torch.manual_seed(2025)
2025-05-07T20:33:29.1791661Z     
2025-05-07T20:33:29.1792383Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.1792975Z     
2025-05-07T20:33:29.1793289Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.1793777Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.1794302Z         x = x_sign * x_clamp
2025-05-07T20:33:29.1794691Z         x0 = x[:, :D]
2025-05-07T20:33:29.1795075Z         x1 = x[:, D:]
2025-05-07T20:33:29.1795438Z     
2025-05-07T20:33:29.1795732Z         if contiguous:
2025-05-07T20:33:29.1796126Z             x0 = x0.contiguous()
2025-05-07T20:33:29.1796553Z             x1 = x1.contiguous()
2025-05-07T20:33:29.1796963Z     
2025-05-07T20:33:29.1797270Z         if scale_ub is not None:
2025-05-07T20:33:29.1797729Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.1798295Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.1798805Z             )
2025-05-07T20:33:29.1799119Z         else:
2025-05-07T20:33:29.1799461Z             scale_ub_tensor = None
2025-05-07T20:33:29.1799886Z     
2025-05-07T20:33:29.1800273Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.1800800Z             op = silu_mul_quant
2025-05-07T20:33:29.1801198Z             if compiled:
2025-05-07T20:33:29.1801593Z                 op = torch.compile(op)
2025-05-07T20:33:29.1802204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.1802657Z     
2025-05-07T20:33:29.1803107Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.1803382Z 
2025-05-07T20:33:29.1803548Z moe/activation_test.py:117: 
2025-05-07T20:33:29.1804034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.1804582Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.1805039Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.1806329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.1807525Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.1808475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.1809638Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.1810770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.1811665Z     kernel = self.compile(
2025-05-07T20:33:29.1812576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.1813706Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.1814375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.1814781Z 
2025-05-07T20:33:29.1815118Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdf80250>
2025-05-07T20:33:29.1817011Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.1819470Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdbd93a0>}
2025-05-07T20:33:29.1821848Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.1823637Z context = <triton._C.libtriton.ir.context object at 0x7f58fdbf68f0>
2025-05-07T20:33:29.1824148Z 
2025-05-07T20:33:29.1824421Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.1825321Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.1826204Z                            module_map=module_map)
2025-05-07T20:33:29.1826795Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.1827379Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.1827809Z E       ^
2025-05-07T20:33:29.1828609Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.1829416Z 
2025-05-07T20:33:29.1830297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.1831212Z 
2025-05-07T20:33:29.1831378Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.1832070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.1832754Z     T=4096,
2025-05-07T20:33:29.1833063Z     D=5120,
2025-05-07T20:33:29.1833372Z     scale_ub=1200.0,
2025-05-07T20:33:29.1833731Z     contiguous=False,
2025-05-07T20:33:29.1834092Z     compiled=True,
2025-05-07T20:33:29.1834416Z )
2025-05-07T20:33:29.1834935Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.1835781Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:29.1836250Z 
2025-05-07T20:33:29.1836383Z     @given(
2025-05-07T20:33:29.1836819Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.1837333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.1837894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.1838440Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.1838977Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.1839454Z     )
2025-05-07T20:33:29.1840035Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.1840849Z     def test_silu_mul_quant(
2025-05-07T20:33:29.1841242Z         self,
2025-05-07T20:33:29.1841554Z         T: int,
2025-05-07T20:33:29.1841859Z         D: int,
2025-05-07T20:33:29.1842207Z         scale_ub: Optional[float],
2025-05-07T20:33:29.1842658Z         contiguous: bool,
2025-05-07T20:33:29.1843038Z         compiled: bool,
2025-05-07T20:33:29.1843397Z     ) -> None:
2025-05-07T20:33:29.1843740Z         torch.manual_seed(2025)
2025-05-07T20:33:29.1844123Z     
2025-05-07T20:33:29.1844565Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.1845142Z     
2025-05-07T20:33:29.1845452Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.1845916Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.1846434Z         x = x_sign * x_clamp
2025-05-07T20:33:29.1846828Z         x0 = x[:, :D]
2025-05-07T20:33:29.1847165Z         x1 = x[:, D:]
2025-05-07T20:33:29.1847496Z     
2025-05-07T20:33:29.1847796Z         if contiguous:
2025-05-07T20:33:29.1848166Z             x0 = x0.contiguous()
2025-05-07T20:33:29.1848585Z             x1 = x1.contiguous()
2025-05-07T20:33:29.1848962Z     
2025-05-07T20:33:29.1849251Z         if scale_ub is not None:
2025-05-07T20:33:29.1849651Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.1850095Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.1850507Z             )
2025-05-07T20:33:29.1850777Z         else:
2025-05-07T20:33:29.1851086Z             scale_ub_tensor = None
2025-05-07T20:33:29.1851426Z     
2025-05-07T20:33:29.1851743Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.1852167Z             op = silu_mul_quant
2025-05-07T20:33:29.1852514Z             if compiled:
2025-05-07T20:33:29.1852844Z                 op = torch.compile(op)
2025-05-07T20:33:29.1853251Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.1853628Z     
2025-05-07T20:33:29.1853877Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.1854121Z 
2025-05-07T20:33:29.1854253Z moe/activation_test.py:117: 
2025-05-07T20:33:29.1854689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.1855307Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.1855724Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.1856528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:29.1857377Z     return fn(*args, **kwargs)
2025-05-07T20:33:29.1858355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.1859521Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.1860412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.1861557Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.1862674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.1863511Z     kernel = self.compile(
2025-05-07T20:33:29.1864388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.1865450Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.1866103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.1866469Z 
2025-05-07T20:33:29.1866912Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe0e9580>
2025-05-07T20:33:29.1868868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.1871402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdbd9280>}
2025-05-07T20:33:29.1873802Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.1875587Z context = <triton._C.libtriton.ir.context object at 0x7f58fde13970>
2025-05-07T20:33:29.1876079Z 
2025-05-07T20:33:29.1876368Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.1877244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.1878049Z                            module_map=module_map)
2025-05-07T20:33:29.1878640Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.1879228Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.1879647Z E       ^
2025-05-07T20:33:29.1880453Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.1881213Z 
2025-05-07T20:33:29.1881957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.1883174Z 
2025-05-07T20:33:29.4640179Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.4640975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.4641659Z     T=2048,
2025-05-07T20:33:29.4641974Z     D=7168,
2025-05-07T20:33:29.4642277Z     scale_ub=1200.0,
2025-05-07T20:33:29.4642643Z     contiguous=False,
2025-05-07T20:33:29.4643002Z     compiled=False,
2025-05-07T20:33:29.4643323Z )
2025-05-07T20:33:29.4643854Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.4644692Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:29.4645201Z 
2025-05-07T20:33:29.4645333Z     @given(
2025-05-07T20:33:29.4645656Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.4646083Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.4646752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.4647235Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.4647724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.4648147Z     )
2025-05-07T20:33:29.4648687Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.4649407Z     def test_silu_mul_quant(
2025-05-07T20:33:29.4649761Z         self,
2025-05-07T20:33:29.4650038Z         T: int,
2025-05-07T20:33:29.4650315Z         D: int,
2025-05-07T20:33:29.4650633Z         scale_ub: Optional[float],
2025-05-07T20:33:29.4651041Z         contiguous: bool,
2025-05-07T20:33:29.4651396Z         compiled: bool,
2025-05-07T20:33:29.4651737Z     ) -> None:
2025-05-07T20:33:29.4652072Z         torch.manual_seed(2025)
2025-05-07T20:33:29.4652448Z     
2025-05-07T20:33:29.4652869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.4653398Z     
2025-05-07T20:33:29.4653692Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.4654151Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.4654685Z         x = x_sign * x_clamp
2025-05-07T20:33:29.4655074Z         x0 = x[:, :D]
2025-05-07T20:33:29.4655400Z         x1 = x[:, D:]
2025-05-07T20:33:29.4655717Z     
2025-05-07T20:33:29.4656006Z         if contiguous:
2025-05-07T20:33:29.4656534Z             x0 = x0.contiguous()
2025-05-07T20:33:29.4657050Z             x1 = x1.contiguous()
2025-05-07T20:33:29.4657438Z     
2025-05-07T20:33:29.4657744Z         if scale_ub is not None:
2025-05-07T20:33:29.4658171Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.4658718Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.4659226Z             )
2025-05-07T20:33:29.4659652Z         else:
2025-05-07T20:33:29.4659989Z             scale_ub_tensor = None
2025-05-07T20:33:29.4660400Z     
2025-05-07T20:33:29.4660759Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.4661269Z             op = silu_mul_quant
2025-05-07T20:33:29.4670537Z             if compiled:
2025-05-07T20:33:29.4670967Z                 op = torch.compile(op)
2025-05-07T20:33:29.4671458Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.4671914Z     
2025-05-07T20:33:29.4672219Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.4672512Z 
2025-05-07T20:33:29.4672673Z moe/activation_test.py:117: 
2025-05-07T20:33:29.4673169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.4673725Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.4674192Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.4675385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.4676587Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.4677516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.4678705Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.4679839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.4680725Z     kernel = self.compile(
2025-05-07T20:33:29.4681602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.4683096Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.4683748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.4684140Z 
2025-05-07T20:33:29.4684473Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fde498e0>
2025-05-07T20:33:29.4686488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.4688948Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fde39670>}
2025-05-07T20:33:29.4691345Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.4693121Z context = <triton._C.libtriton.ir.context object at 0x7f58fdca1eb0>
2025-05-07T20:33:29.4693633Z 
2025-05-07T20:33:29.4693910Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.4694816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.4695635Z                            module_map=module_map)
2025-05-07T20:33:29.4696232Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.4696827Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.4697260Z E       ^
2025-05-07T20:33:29.4698041Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.4698831Z 
2025-05-07T20:33:29.4699677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.4700671Z 
2025-05-07T20:33:29.4700841Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.4701532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.4702211Z     T=1,
2025-05-07T20:33:29.4702508Z     D=7168,
2025-05-07T20:33:29.4702816Z     scale_ub=None,
2025-05-07T20:33:29.4703149Z     contiguous=True,
2025-05-07T20:33:29.4703620Z     compiled=False,
2025-05-07T20:33:29.4703950Z )
2025-05-07T20:33:29.4704463Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.4705270Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:29.4705720Z 
2025-05-07T20:33:29.4705847Z     @given(
2025-05-07T20:33:29.4706204Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.4706720Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.4707228Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.4707770Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.4708374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.4708844Z     )
2025-05-07T20:33:29.4709426Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.4710295Z     def test_silu_mul_quant(
2025-05-07T20:33:29.4710685Z         self,
2025-05-07T20:33:29.4711002Z         T: int,
2025-05-07T20:33:29.4711300Z         D: int,
2025-05-07T20:33:29.4711646Z         scale_ub: Optional[float],
2025-05-07T20:33:29.4712084Z         contiguous: bool,
2025-05-07T20:33:29.4712464Z         compiled: bool,
2025-05-07T20:33:29.4712825Z     ) -> None:
2025-05-07T20:33:29.4713165Z         torch.manual_seed(2025)
2025-05-07T20:33:29.4713548Z     
2025-05-07T20:33:29.4713987Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.4714561Z     
2025-05-07T20:33:29.4714865Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.4715341Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.4715854Z         x = x_sign * x_clamp
2025-05-07T20:33:29.4716229Z         x0 = x[:, :D]
2025-05-07T20:33:29.4716573Z         x1 = x[:, D:]
2025-05-07T20:33:29.4716900Z     
2025-05-07T20:33:29.4717183Z         if contiguous:
2025-05-07T20:33:29.4717556Z             x0 = x0.contiguous()
2025-05-07T20:33:29.4717978Z             x1 = x1.contiguous()
2025-05-07T20:33:29.4718374Z     
2025-05-07T20:33:29.4718677Z         if scale_ub is not None:
2025-05-07T20:33:29.4719111Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.4719732Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.4720228Z             )
2025-05-07T20:33:29.4720534Z         else:
2025-05-07T20:33:29.4720864Z             scale_ub_tensor = None
2025-05-07T20:33:29.4721269Z     
2025-05-07T20:33:29.4721640Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.4722157Z             op = silu_mul_quant
2025-05-07T20:33:29.4722551Z             if compiled:
2025-05-07T20:33:29.4722948Z                 op = torch.compile(op)
2025-05-07T20:33:29.4723424Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.4723869Z     
2025-05-07T20:33:29.4724168Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.4724434Z 
2025-05-07T20:33:29.4724585Z moe/activation_test.py:117: 
2025-05-07T20:33:29.4725063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.4725618Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.4726085Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.4727284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.4728489Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.4729413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.4730677Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.4731888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.4732811Z     kernel = self.compile(
2025-05-07T20:33:29.4733739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.4734928Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.4735577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.4735966Z 
2025-05-07T20:33:29.4736304Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fde3d100>
2025-05-07T20:33:29.4738168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.4740552Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd9d6280>}
2025-05-07T20:33:29.4742811Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.4744606Z context = <triton._C.libtriton.ir.context object at 0x7f58fd9e8b70>
2025-05-07T20:33:29.4745107Z 
2025-05-07T20:33:29.4745383Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.4746275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.4747074Z                            module_map=module_map)
2025-05-07T20:33:29.4747662Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.4748247Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.4748677Z E       ^
2025-05-07T20:33:29.4749470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.4750383Z 
2025-05-07T20:33:29.4751108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.4752024Z 
2025-05-07T20:33:29.4752200Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.4752894Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.4753566Z     T=16384,
2025-05-07T20:33:29.4753963Z     D=7168,
2025-05-07T20:33:29.4754278Z     scale_ub=1200.0,
2025-05-07T20:33:29.4754628Z     contiguous=False,
2025-05-07T20:33:29.4754986Z     compiled=True,
2025-05-07T20:33:29.4755313Z )
2025-05-07T20:33:29.6664459Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.6665428Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:29.6665911Z 
2025-05-07T20:33:29.6666043Z     @given(
2025-05-07T20:33:29.6666408Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.6666928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.6667438Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.6667983Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.6668548Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.6669035Z     )
2025-05-07T20:33:29.6669520Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.6670363Z     def test_silu_mul_quant(
2025-05-07T20:33:29.6670711Z         self,
2025-05-07T20:33:29.6670980Z         T: int,
2025-05-07T20:33:29.6671264Z         D: int,
2025-05-07T20:33:29.6671580Z         scale_ub: Optional[float],
2025-05-07T20:33:29.6671977Z         contiguous: bool,
2025-05-07T20:33:29.6672696Z         compiled: bool,
2025-05-07T20:33:29.6673063Z     ) -> None:
2025-05-07T20:33:29.6673496Z         torch.manual_seed(2025)
2025-05-07T20:33:29.6673858Z     
2025-05-07T20:33:29.6674260Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.6674784Z     
2025-05-07T20:33:29.6675068Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.6675528Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.6676165Z         x = x_sign * x_clamp
2025-05-07T20:33:29.6676524Z         x0 = x[:, :D]
2025-05-07T20:33:29.6676851Z         x1 = x[:, D:]
2025-05-07T20:33:29.6677176Z     
2025-05-07T20:33:29.6677472Z         if contiguous:
2025-05-07T20:33:29.6677877Z             x0 = x0.contiguous()
2025-05-07T20:33:29.6678304Z             x1 = x1.contiguous()
2025-05-07T20:33:29.6678667Z     
2025-05-07T20:33:29.6678970Z         if scale_ub is not None:
2025-05-07T20:33:29.6679410Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.6679961Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.6680473Z             )
2025-05-07T20:33:29.6680802Z         else:
2025-05-07T20:33:29.6681135Z             scale_ub_tensor = None
2025-05-07T20:33:29.6681545Z     
2025-05-07T20:33:29.6681921Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.6682442Z             op = silu_mul_quant
2025-05-07T20:33:29.6683368Z             if compiled:
2025-05-07T20:33:29.6683790Z                 op = torch.compile(op)
2025-05-07T20:33:29.6684274Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.6684713Z     
2025-05-07T20:33:29.6685016Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.6685290Z 
2025-05-07T20:33:29.6685454Z moe/activation_test.py:117: 
2025-05-07T20:33:29.6685935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.6686491Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.6686957Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.6687922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:29.6688900Z     return fn(*args, **kwargs)
2025-05-07T20:33:29.6690044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.6691246Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.6692154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.6693347Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.6694656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.6695582Z     kernel = self.compile(
2025-05-07T20:33:29.6696522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.6697662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.6698340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.6698735Z 
2025-05-07T20:33:29.6699077Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd9f2760>
2025-05-07T20:33:29.6700977Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.6703388Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd9d6ee0>}
2025-05-07T20:33:29.6705695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.6707569Z context = <triton._C.libtriton.ir.context object at 0x7f58fdce29f0>
2025-05-07T20:33:29.6708161Z 
2025-05-07T20:33:29.6708430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.6709337Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.6710262Z                            module_map=module_map)
2025-05-07T20:33:29.6710964Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.6711543Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.6711965Z E       ^
2025-05-07T20:33:29.6712764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.6713569Z 
2025-05-07T20:33:29.6714272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.6715169Z 
2025-05-07T20:33:29.6715346Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.6716038Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.6716716Z     T=1,
2025-05-07T20:33:29.6717008Z     D=7168,
2025-05-07T20:33:29.6717307Z     scale_ub=None,
2025-05-07T20:33:29.6717650Z     contiguous=False,
2025-05-07T20:33:29.6718016Z     compiled=False,
2025-05-07T20:33:29.6718345Z )
2025-05-07T20:33:29.6718860Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.6719696Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:29.6720144Z 
2025-05-07T20:33:29.6720273Z     @given(
2025-05-07T20:33:29.6720626Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.6721138Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.6721644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.6722190Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.6722733Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.6723205Z     )
2025-05-07T20:33:29.6723791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.6724539Z     def test_silu_mul_quant(
2025-05-07T20:33:29.6724930Z         self,
2025-05-07T20:33:29.6725237Z         T: int,
2025-05-07T20:33:29.6725543Z         D: int,
2025-05-07T20:33:29.6725887Z         scale_ub: Optional[float],
2025-05-07T20:33:29.6726336Z         contiguous: bool,
2025-05-07T20:33:29.6726720Z         compiled: bool,
2025-05-07T20:33:29.6727069Z     ) -> None:
2025-05-07T20:33:29.6727409Z         torch.manual_seed(2025)
2025-05-07T20:33:29.6727790Z     
2025-05-07T20:33:29.6728373Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.6728951Z     
2025-05-07T20:33:29.6729249Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.6729724Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.6730241Z         x = x_sign * x_clamp
2025-05-07T20:33:29.6730634Z         x0 = x[:, :D]
2025-05-07T20:33:29.6730971Z         x1 = x[:, D:]
2025-05-07T20:33:29.6731310Z     
2025-05-07T20:33:29.6731606Z         if contiguous:
2025-05-07T20:33:29.6731971Z             x0 = x0.contiguous()
2025-05-07T20:33:29.6732390Z             x1 = x1.contiguous()
2025-05-07T20:33:29.6732783Z     
2025-05-07T20:33:29.6733079Z         if scale_ub is not None:
2025-05-07T20:33:29.6733511Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.6734058Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.6734561Z             )
2025-05-07T20:33:29.6734871Z         else:
2025-05-07T20:33:29.6735220Z             scale_ub_tensor = None
2025-05-07T20:33:29.6735628Z     
2025-05-07T20:33:29.6735996Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.6736517Z             op = silu_mul_quant
2025-05-07T20:33:29.6736916Z             if compiled:
2025-05-07T20:33:29.6737313Z                 op = torch.compile(op)
2025-05-07T20:33:29.6737896Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.6738396Z     
2025-05-07T20:33:29.6738690Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.6738968Z 
2025-05-07T20:33:29.6739126Z moe/activation_test.py:117: 
2025-05-07T20:33:29.6739610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.6740157Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.6740694Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.6741864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.6743068Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.6743987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.6745173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.6746286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.6747173Z     kernel = self.compile(
2025-05-07T20:33:29.6748109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.6749260Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.6750079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.6750486Z 
2025-05-07T20:33:29.6750828Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdce5b80>
2025-05-07T20:33:29.6752737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.6755213Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdcfd670>}
2025-05-07T20:33:29.6757570Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.6759325Z context = <triton._C.libtriton.ir.context object at 0x7f58fd9cb3b0>
2025-05-07T20:33:29.6759836Z 
2025-05-07T20:33:29.6760110Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.6761012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.6761878Z                            module_map=module_map)
2025-05-07T20:33:29.6762431Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.6762941Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.6763356Z E       ^
2025-05-07T20:33:29.6764140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.6764945Z 
2025-05-07T20:33:29.6765670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.6766580Z 
2025-05-07T20:33:29.6766746Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.6767439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.6768118Z     T=2048,
2025-05-07T20:33:29.6768419Z     D=7168,
2025-05-07T20:33:29.6768727Z     scale_ub=None,
2025-05-07T20:33:29.6769069Z     contiguous=False,
2025-05-07T20:33:29.6769432Z     compiled=True,
2025-05-07T20:33:29.6769760Z )
2025-05-07T20:33:29.9643936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.9644877Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:29.9645345Z 
2025-05-07T20:33:29.9645476Z     @given(
2025-05-07T20:33:29.9646213Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.9646820Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.9647287Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.9647772Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.9648278Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.9648740Z     )
2025-05-07T20:33:29.9649324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.9650233Z     def test_silu_mul_quant(
2025-05-07T20:33:29.9650627Z         self,
2025-05-07T20:33:29.9650931Z         T: int,
2025-05-07T20:33:29.9651260Z         D: int,
2025-05-07T20:33:29.9651614Z         scale_ub: Optional[float],
2025-05-07T20:33:29.9652052Z         contiguous: bool,
2025-05-07T20:33:29.9652441Z         compiled: bool,
2025-05-07T20:33:29.9652813Z     ) -> None:
2025-05-07T20:33:29.9653148Z         torch.manual_seed(2025)
2025-05-07T20:33:29.9653539Z     
2025-05-07T20:33:29.9653987Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.9654559Z     
2025-05-07T20:33:29.9654865Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.9655334Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.9655842Z         x = x_sign * x_clamp
2025-05-07T20:33:29.9656234Z         x0 = x[:, :D]
2025-05-07T20:33:29.9656578Z         x1 = x[:, D:]
2025-05-07T20:33:29.9656913Z     
2025-05-07T20:33:29.9657210Z         if contiguous:
2025-05-07T20:33:29.9657579Z             x0 = x0.contiguous()
2025-05-07T20:33:29.9657997Z             x1 = x1.contiguous()
2025-05-07T20:33:29.9658388Z     
2025-05-07T20:33:29.9658699Z         if scale_ub is not None:
2025-05-07T20:33:29.9659146Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.9659687Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.9660215Z             )
2025-05-07T20:33:29.9660538Z         else:
2025-05-07T20:33:29.9660879Z             scale_ub_tensor = None
2025-05-07T20:33:29.9661290Z     
2025-05-07T20:33:29.9661660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.9662171Z             op = silu_mul_quant
2025-05-07T20:33:29.9662578Z             if compiled:
2025-05-07T20:33:29.9662981Z                 op = torch.compile(op)
2025-05-07T20:33:29.9663454Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.9663907Z     
2025-05-07T20:33:29.9664220Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.9664489Z 
2025-05-07T20:33:29.9664647Z moe/activation_test.py:117: 
2025-05-07T20:33:29.9665267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.9665838Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.9666302Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.9667248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:29.9668226Z     return fn(*args, **kwargs)
2025-05-07T20:33:29.9669372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.9670758Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.9671675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.9672834Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.9673962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.9674860Z     kernel = self.compile(
2025-05-07T20:33:29.9675766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.9676871Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.9677609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.9678014Z 
2025-05-07T20:33:29.9678351Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdc59610>
2025-05-07T20:33:29.9680308Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.9683140Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdc6f550>}
2025-05-07T20:33:29.9685516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.9687300Z context = <triton._C.libtriton.ir.context object at 0x7f58fd8813b0>
2025-05-07T20:33:29.9687799Z 
2025-05-07T20:33:29.9688074Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.9688962Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.9689752Z                            module_map=module_map)
2025-05-07T20:33:29.9690342Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.9690926Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.9691362Z E       ^
2025-05-07T20:33:29.9692153Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.9692954Z 
2025-05-07T20:33:29.9693683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.9694596Z 
2025-05-07T20:33:29.9694766Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.9695458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.9696134Z     T=4096,
2025-05-07T20:33:29.9696444Z     D=7168,
2025-05-07T20:33:29.9696754Z     scale_ub=None,
2025-05-07T20:33:29.9697095Z     contiguous=False,
2025-05-07T20:33:29.9697460Z     compiled=True,
2025-05-07T20:33:29.9697790Z )
2025-05-07T20:33:29.9698310Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.9699148Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:29.9699630Z 
2025-05-07T20:33:29.9699751Z     @given(
2025-05-07T20:33:29.9708887Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.9709428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.9710149Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.9710709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.9711270Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.9711748Z     )
2025-05-07T20:33:29.9712333Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.9713101Z     def test_silu_mul_quant(
2025-05-07T20:33:29.9713510Z         self,
2025-05-07T20:33:29.9713815Z         T: int,
2025-05-07T20:33:29.9714137Z         D: int,
2025-05-07T20:33:29.9714489Z         scale_ub: Optional[float],
2025-05-07T20:33:29.9714932Z         contiguous: bool,
2025-05-07T20:33:29.9715317Z         compiled: bool,
2025-05-07T20:33:29.9715679Z     ) -> None:
2025-05-07T20:33:29.9715978Z         torch.manual_seed(2025)
2025-05-07T20:33:29.9716300Z     
2025-05-07T20:33:29.9716676Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.9717145Z     
2025-05-07T20:33:29.9717407Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.9717805Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.9718246Z         x = x_sign * x_clamp
2025-05-07T20:33:29.9718568Z         x0 = x[:, :D]
2025-05-07T20:33:29.9718872Z         x1 = x[:, D:]
2025-05-07T20:33:29.9719156Z     
2025-05-07T20:33:29.9719530Z         if contiguous:
2025-05-07T20:33:29.9719844Z             x0 = x0.contiguous()
2025-05-07T20:33:29.9720290Z             x1 = x1.contiguous()
2025-05-07T20:33:29.9720645Z     
2025-05-07T20:33:29.9720931Z         if scale_ub is not None:
2025-05-07T20:33:29.9721328Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.9721798Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.9722369Z             )
2025-05-07T20:33:29.9722666Z         else:
2025-05-07T20:33:29.9722963Z             scale_ub_tensor = None
2025-05-07T20:33:29.9723339Z     
2025-05-07T20:33:29.9723683Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.9724143Z             op = silu_mul_quant
2025-05-07T20:33:29.9724484Z             if compiled:
2025-05-07T20:33:29.9724832Z                 op = torch.compile(op)
2025-05-07T20:33:29.9725279Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.9725693Z     
2025-05-07T20:33:29.9725978Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.9726219Z 
2025-05-07T20:33:29.9726361Z moe/activation_test.py:117: 
2025-05-07T20:33:29.9726805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.9727304Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.9727685Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.9728512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:29.9729456Z     return fn(*args, **kwargs)
2025-05-07T20:33:29.9730567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.9731728Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.9732617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.9733764Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.9734885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.9735767Z     kernel = self.compile(
2025-05-07T20:33:29.9736662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.9737759Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.9738413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.9738793Z 
2025-05-07T20:33:29.9739120Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd84d4c0>
2025-05-07T20:33:29.9741066Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.9743468Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd965160>}
2025-05-07T20:33:29.9745788Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.9747531Z context = <triton._C.libtriton.ir.context object at 0x7f58fd96c730>
2025-05-07T20:33:29.9748036Z 
2025-05-07T20:33:29.9748332Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.9749196Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.9750112Z                            module_map=module_map)
2025-05-07T20:33:29.9750681Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.9751240Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.9751650Z E       ^
2025-05-07T20:33:29.9752472Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.9753298Z 
2025-05-07T20:33:29.9754002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.9754882Z 
2025-05-07T20:33:30.1807197Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.1807874Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.1809378Z     T=16384,
2025-05-07T20:33:30.1809757Z     D=5120,
2025-05-07T20:33:30.1810144Z     scale_ub=1200.0,
2025-05-07T20:33:30.1810591Z     contiguous=False,
2025-05-07T20:33:30.1811028Z     compiled=False,
2025-05-07T20:33:30.1811430Z )
2025-05-07T20:33:30.1812077Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.1813124Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:30.1813734Z 
2025-05-07T20:33:30.1813899Z     @given(
2025-05-07T20:33:30.1814357Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.1814990Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.1815610Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.1816282Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.1816948Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.1817523Z     )
2025-05-07T20:33:30.1818192Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.1818711Z     def test_silu_mul_quant(
2025-05-07T20:33:30.1818953Z         self,
2025-05-07T20:33:30.1819147Z         T: int,
2025-05-07T20:33:30.1819349Z         D: int,
2025-05-07T20:33:30.1819563Z         scale_ub: Optional[float],
2025-05-07T20:33:30.1819840Z         contiguous: bool,
2025-05-07T20:33:30.1820084Z         compiled: bool,
2025-05-07T20:33:30.1820309Z     ) -> None:
2025-05-07T20:33:30.1820529Z         torch.manual_seed(2025)
2025-05-07T20:33:30.1820778Z     
2025-05-07T20:33:30.1821050Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.1821409Z     
2025-05-07T20:33:30.1821601Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.1821896Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.1822210Z         x = x_sign * x_clamp
2025-05-07T20:33:30.1822458Z         x0 = x[:, :D]
2025-05-07T20:33:30.1822680Z         x1 = x[:, D:]
2025-05-07T20:33:30.1822884Z     
2025-05-07T20:33:30.1823071Z         if contiguous:
2025-05-07T20:33:30.1823304Z             x0 = x0.contiguous()
2025-05-07T20:33:30.1823654Z             x1 = x1.contiguous()
2025-05-07T20:33:30.1823902Z     
2025-05-07T20:33:30.1824095Z         if scale_ub is not None:
2025-05-07T20:33:30.1824366Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.1824712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.1825033Z             )
2025-05-07T20:33:30.1825223Z         else:
2025-05-07T20:33:30.1825439Z             scale_ub_tensor = None
2025-05-07T20:33:30.1825702Z     
2025-05-07T20:33:30.1825931Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.1826257Z             op = silu_mul_quant
2025-05-07T20:33:30.1826518Z             if compiled:
2025-05-07T20:33:30.1826797Z                 op = torch.compile(op)
2025-05-07T20:33:30.1827098Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.1827392Z     
2025-05-07T20:33:30.1827589Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.1827759Z 
2025-05-07T20:33:30.1827863Z moe/activation_test.py:117: 
2025-05-07T20:33:30.1828167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.1828514Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.1828806Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.1829648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.1830542Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.1831201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.1831936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.1832645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.1833259Z     kernel = self.compile(
2025-05-07T20:33:30.1833838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.1834532Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.1834949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.1835199Z 
2025-05-07T20:33:30.1835412Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd857640>
2025-05-07T20:33:30.1836588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.1838117Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd965940>}
2025-05-07T20:33:30.1839584Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.1840690Z context = <triton._C.libtriton.ir.context object at 0x7f58fd7e0770>
2025-05-07T20:33:30.1840992Z 
2025-05-07T20:33:30.1841170Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.1841717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.1842205Z                            module_map=module_map)
2025-05-07T20:33:30.1842581Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.1842947Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.1843209Z E       ^
2025-05-07T20:33:30.1843700Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.1844190Z 
2025-05-07T20:33:30.1844642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.1845243Z 
2025-05-07T20:33:30.1845360Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.1845785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.1846207Z     T=16384,
2025-05-07T20:33:30.1846400Z     D=5120,
2025-05-07T20:33:30.1846588Z     scale_ub=1200.0,
2025-05-07T20:33:30.1846811Z     contiguous=True,
2025-05-07T20:33:30.1847037Z     compiled=True,
2025-05-07T20:33:30.1847235Z )
2025-05-07T20:33:30.1847560Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.1848078Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:30.1848371Z 
2025-05-07T20:33:30.1848457Z     @given(
2025-05-07T20:33:30.1848683Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.1849012Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.1849329Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.1849666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.1850005Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.1850304Z     )
2025-05-07T20:33:30.1850664Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.1851135Z     def test_silu_mul_quant(
2025-05-07T20:33:30.1851383Z         self,
2025-05-07T20:33:30.1851627Z         T: int,
2025-05-07T20:33:30.1851827Z         D: int,
2025-05-07T20:33:30.1852087Z         scale_ub: Optional[float],
2025-05-07T20:33:30.1852368Z         contiguous: bool,
2025-05-07T20:33:30.1852606Z         compiled: bool,
2025-05-07T20:33:30.1852835Z     ) -> None:
2025-05-07T20:33:30.1853053Z         torch.manual_seed(2025)
2025-05-07T20:33:30.1853295Z     
2025-05-07T20:33:30.1853575Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.1853974Z     
2025-05-07T20:33:30.1854164Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.1854462Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.1854786Z         x = x_sign * x_clamp
2025-05-07T20:33:30.1855027Z         x0 = x[:, :D]
2025-05-07T20:33:30.1855250Z         x1 = x[:, D:]
2025-05-07T20:33:30.1855461Z     
2025-05-07T20:33:30.1855659Z         if contiguous:
2025-05-07T20:33:30.1855897Z             x0 = x0.contiguous()
2025-05-07T20:33:30.1856164Z             x1 = x1.contiguous()
2025-05-07T20:33:30.1856415Z     
2025-05-07T20:33:30.1856616Z         if scale_ub is not None:
2025-05-07T20:33:30.1856888Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.1857237Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.1857563Z             )
2025-05-07T20:33:30.1857760Z         else:
2025-05-07T20:33:30.1857966Z             scale_ub_tensor = None
2025-05-07T20:33:30.1858230Z     
2025-05-07T20:33:30.1858467Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.1858787Z             op = silu_mul_quant
2025-05-07T20:33:30.1859047Z             if compiled:
2025-05-07T20:33:30.1859303Z                 op = torch.compile(op)
2025-05-07T20:33:30.1859603Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.1859895Z     
2025-05-07T20:33:30.1860091Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.1860259Z 
2025-05-07T20:33:30.1860357Z moe/activation_test.py:117: 
2025-05-07T20:33:30.1860667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.1861021Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.1861307Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.1861892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:30.1862494Z     return fn(*args, **kwargs)
2025-05-07T20:33:30.1863198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.1863936Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.1864558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.1865291Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.1866004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.1866565Z     kernel = self.compile(
2025-05-07T20:33:30.1867139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.1867839Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.1868252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.1868504Z 
2025-05-07T20:33:30.1868716Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdd098b0>
2025-05-07T20:33:30.1870010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.1871514Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fda0d550>}
2025-05-07T20:33:30.1873028Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.1874162Z context = <triton._C.libtriton.ir.context object at 0x7f58fda446f0>
2025-05-07T20:33:30.1874470Z 
2025-05-07T20:33:30.1874639Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.1875226Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.1875715Z                            module_map=module_map)
2025-05-07T20:33:30.1876089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.1876459Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.1876721Z E       ^
2025-05-07T20:33:30.1877209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.1877707Z 
2025-05-07T20:33:30.1878155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.1878719Z 
2025-05-07T20:33:30.4116594Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.4117268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.4117848Z     T=16384,
2025-05-07T20:33:30.4118114Z     D=5120,
2025-05-07T20:33:30.4118317Z     scale_ub=None,
2025-05-07T20:33:30.4118582Z     contiguous=False,
2025-05-07T20:33:30.4118893Z     compiled=True,
2025-05-07T20:33:30.4119189Z )
2025-05-07T20:33:30.4119665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.4120340Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:30.4120642Z 
2025-05-07T20:33:30.4120723Z     @given(
2025-05-07T20:33:30.4120961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.4121293Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.4121609Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.4121964Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.4122315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.4122613Z     )
2025-05-07T20:33:30.4122983Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.4123460Z     def test_silu_mul_quant(
2025-05-07T20:33:30.4123713Z         self,
2025-05-07T20:33:30.4123907Z         T: int,
2025-05-07T20:33:30.4124110Z         D: int,
2025-05-07T20:33:30.4124612Z         scale_ub: Optional[float],
2025-05-07T20:33:30.4124893Z         contiguous: bool,
2025-05-07T20:33:30.4125139Z         compiled: bool,
2025-05-07T20:33:30.4125372Z     ) -> None:
2025-05-07T20:33:30.4125592Z         torch.manual_seed(2025)
2025-05-07T20:33:30.4125843Z     
2025-05-07T20:33:30.4126127Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.4126483Z     
2025-05-07T20:33:30.4126682Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.4126984Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.4127307Z         x = x_sign * x_clamp
2025-05-07T20:33:30.4127564Z         x0 = x[:, :D]
2025-05-07T20:33:30.4127790Z         x1 = x[:, D:]
2025-05-07T20:33:30.4127999Z     
2025-05-07T20:33:30.4128191Z         if contiguous:
2025-05-07T20:33:30.4128434Z             x0 = x0.contiguous()
2025-05-07T20:33:30.4128695Z             x1 = x1.contiguous()
2025-05-07T20:33:30.4128948Z     
2025-05-07T20:33:30.4129145Z         if scale_ub is not None:
2025-05-07T20:33:30.4129433Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.4129777Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.4130104Z             )
2025-05-07T20:33:30.4130302Z         else:
2025-05-07T20:33:30.4130513Z             scale_ub_tensor = None
2025-05-07T20:33:30.4130777Z     
2025-05-07T20:33:30.4131137Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.4131532Z             op = silu_mul_quant
2025-05-07T20:33:30.4131787Z             if compiled:
2025-05-07T20:33:30.4132036Z                 op = torch.compile(op)
2025-05-07T20:33:30.4132336Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.4132624Z     
2025-05-07T20:33:30.4132820Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.4133075Z 
2025-05-07T20:33:30.4133176Z moe/activation_test.py:117: 
2025-05-07T20:33:30.4133479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.4133832Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.4134128Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.4134719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:30.4135322Z     return fn(*args, **kwargs)
2025-05-07T20:33:30.4136034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.4136848Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.4137417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.4138150Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.4138860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.4139427Z     kernel = self.compile(
2025-05-07T20:33:30.4140003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.4140699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.4141106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.4141353Z 
2025-05-07T20:33:30.4141568Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd8d5d00>
2025-05-07T20:33:30.4142748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.4144270Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd7b71f0>}
2025-05-07T20:33:30.4145799Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.4146905Z context = <triton._C.libtriton.ir.context object at 0x7f58fd7aa6b0>
2025-05-07T20:33:30.4147215Z 
2025-05-07T20:33:30.4147402Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.4147952Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.4148444Z                            module_map=module_map)
2025-05-07T20:33:30.4148814Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.4149177Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.4149440Z E       ^
2025-05-07T20:33:30.4150088Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.4150583Z 
2025-05-07T20:33:30.4151034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.4151597Z 
2025-05-07T20:33:30.4151702Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.4152128Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.4152542Z     T=2048,
2025-05-07T20:33:30.4152735Z     D=5120,
2025-05-07T20:33:30.4152978Z     scale_ub=None,
2025-05-07T20:33:30.4153192Z     contiguous=False,
2025-05-07T20:33:30.4153462Z     compiled=True,
2025-05-07T20:33:30.4153665Z )
2025-05-07T20:33:30.5362051Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.5362767Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:30.5363063Z 
2025-05-07T20:33:30.5363141Z     @given(
2025-05-07T20:33:30.5363677Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.5364020Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.5364329Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.5364672Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.5365014Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.5365308Z     )
2025-05-07T20:33:30.5365667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.5366139Z     def test_silu_mul_quant(
2025-05-07T20:33:30.5366379Z         self,
2025-05-07T20:33:30.5366579Z         T: int,
2025-05-07T20:33:30.5366777Z         D: int,
2025-05-07T20:33:30.5366997Z         scale_ub: Optional[float],
2025-05-07T20:33:30.5367269Z         contiguous: bool,
2025-05-07T20:33:30.5367520Z         compiled: bool,
2025-05-07T20:33:30.5367754Z     ) -> None:
2025-05-07T20:33:30.5367968Z         torch.manual_seed(2025)
2025-05-07T20:33:30.5368225Z     
2025-05-07T20:33:30.5368515Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.5368878Z     
2025-05-07T20:33:30.5369077Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.5369384Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.5369705Z         x = x_sign * x_clamp
2025-05-07T20:33:30.5369953Z         x0 = x[:, :D]
2025-05-07T20:33:30.5370177Z         x1 = x[:, D:]
2025-05-07T20:33:30.5370384Z     
2025-05-07T20:33:30.5370578Z         if contiguous:
2025-05-07T20:33:30.5370824Z             x0 = x0.contiguous()
2025-05-07T20:33:30.5371090Z             x1 = x1.contiguous()
2025-05-07T20:33:30.5371343Z     
2025-05-07T20:33:30.5371541Z         if scale_ub is not None:
2025-05-07T20:33:30.5371820Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.5372176Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.5372502Z             )
2025-05-07T20:33:30.5372702Z         else:
2025-05-07T20:33:30.5372917Z             scale_ub_tensor = None
2025-05-07T20:33:30.5373182Z     
2025-05-07T20:33:30.5373417Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.5380899Z             op = silu_mul_quant
2025-05-07T20:33:30.5381312Z             if compiled:
2025-05-07T20:33:30.5381583Z                 op = torch.compile(op)
2025-05-07T20:33:30.5381897Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.5382183Z     
2025-05-07T20:33:30.5382389Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.5382568Z 
2025-05-07T20:33:30.5382671Z moe/activation_test.py:117: 
2025-05-07T20:33:30.5383261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.5383606Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.5383897Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.5384499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:30.5385100Z     return fn(*args, **kwargs)
2025-05-07T20:33:30.5385818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.5386572Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.5387143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.5387866Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.5388674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.5389329Z     kernel = self.compile(
2025-05-07T20:33:30.5390006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.5390709Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.5391125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.5391437Z 
2025-05-07T20:33:30.5391659Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd78d550>
2025-05-07T20:33:30.5392826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.5394355Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd7b7f70>}
2025-05-07T20:33:30.5395827Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.5396937Z context = <triton._C.libtriton.ir.context object at 0x7f58fd74ad30>
2025-05-07T20:33:30.5397243Z 
2025-05-07T20:33:30.5397423Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.5397967Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.5398472Z                            module_map=module_map)
2025-05-07T20:33:30.5398850Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.5399211Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.5399480Z E       ^
2025-05-07T20:33:30.5399975Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.5400465Z 
2025-05-07T20:33:30.5400924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.5401482Z 
2025-05-07T20:33:30.5401585Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.5402015Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.5402442Z     T=2048,
2025-05-07T20:33:30.5402631Z     D=5120,
2025-05-07T20:33:30.5402823Z     scale_ub=1200.0,
2025-05-07T20:33:30.5403053Z     contiguous=False,
2025-05-07T20:33:30.5403277Z     compiled=True,
2025-05-07T20:33:30.5403553Z )
2025-05-07T20:33:30.5403886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.5404403Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:30.5404700Z 
2025-05-07T20:33:30.5404779Z     @given(
2025-05-07T20:33:30.5405016Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.5405333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.5405648Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.5405986Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.5406323Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.5406611Z     )
2025-05-07T20:33:30.5406971Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.5407440Z     def test_silu_mul_quant(
2025-05-07T20:33:30.5407678Z         self,
2025-05-07T20:33:30.5407873Z         T: int,
2025-05-07T20:33:30.5408073Z         D: int,
2025-05-07T20:33:30.5408313Z         scale_ub: Optional[float],
2025-05-07T20:33:30.5408616Z         contiguous: bool,
2025-05-07T20:33:30.5408861Z         compiled: bool,
2025-05-07T20:33:30.5409079Z     ) -> None:
2025-05-07T20:33:30.5409296Z         torch.manual_seed(2025)
2025-05-07T20:33:30.5409542Z     
2025-05-07T20:33:30.5409860Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.5410260Z     
2025-05-07T20:33:30.5410456Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.5410753Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.5411062Z         x = x_sign * x_clamp
2025-05-07T20:33:30.5411309Z         x0 = x[:, :D]
2025-05-07T20:33:30.5411528Z         x1 = x[:, D:]
2025-05-07T20:33:30.5411729Z     
2025-05-07T20:33:30.5411977Z         if contiguous:
2025-05-07T20:33:30.5412210Z             x0 = x0.contiguous()
2025-05-07T20:33:30.5412463Z             x1 = x1.contiguous()
2025-05-07T20:33:30.5412703Z     
2025-05-07T20:33:30.5412900Z         if scale_ub is not None:
2025-05-07T20:33:30.5413171Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.5413515Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.5413833Z             )
2025-05-07T20:33:30.5414022Z         else:
2025-05-07T20:33:30.5414237Z             scale_ub_tensor = None
2025-05-07T20:33:30.5414498Z     
2025-05-07T20:33:30.5414721Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.5415049Z             op = silu_mul_quant
2025-05-07T20:33:30.5415303Z             if compiled:
2025-05-07T20:33:30.5415552Z                 op = torch.compile(op)
2025-05-07T20:33:30.5415850Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.5416140Z     
2025-05-07T20:33:30.5416337Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.5416505Z 
2025-05-07T20:33:30.5416602Z moe/activation_test.py:117: 
2025-05-07T20:33:30.5416908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.5417256Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.5417537Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.5418127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:30.5418772Z     return fn(*args, **kwargs)
2025-05-07T20:33:30.5419478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.5420215Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.5420779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.5421509Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.5422213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.5422779Z     kernel = self.compile(
2025-05-07T20:33:30.5423399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.5424098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.5424505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.5424755Z 
2025-05-07T20:33:30.5424966Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd71b610>
2025-05-07T20:33:30.5426136Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.5427639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd738940>}
2025-05-07T20:33:30.5429108Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.5430300Z context = <triton._C.libtriton.ir.context object at 0x7f58fd633730>
2025-05-07T20:33:30.5430609Z 
2025-05-07T20:33:30.5430827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.5431374Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.5431902Z                            module_map=module_map)
2025-05-07T20:33:30.5432277Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.5432638Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.5432904Z E       ^
2025-05-07T20:33:30.5433431Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.5433927Z 
2025-05-07T20:33:30.5434375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.5434926Z 
2025-05-07T20:33:30.9424430Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.9424996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.9425432Z     T=4096,
2025-05-07T20:33:30.9425637Z     D=5120,
2025-05-07T20:33:30.9425836Z     scale_ub=1200.0,
2025-05-07T20:33:30.9426068Z     contiguous=True,
2025-05-07T20:33:30.9426300Z     compiled=True,
2025-05-07T20:33:30.9426514Z )
2025-05-07T20:33:30.9426845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.9427381Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:30.9427687Z 
2025-05-07T20:33:30.9427777Z     @given(
2025-05-07T20:33:30.9428012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.9428336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.9428666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.9429022Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.9429366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.9429667Z     )
2025-05-07T20:33:30.9430216Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.9430688Z     def test_silu_mul_quant(
2025-05-07T20:33:30.9430945Z         self,
2025-05-07T20:33:30.9431147Z         T: int,
2025-05-07T20:33:30.9431354Z         D: int,
2025-05-07T20:33:30.9431577Z         scale_ub: Optional[float],
2025-05-07T20:33:30.9431863Z         contiguous: bool,
2025-05-07T20:33:30.9432111Z         compiled: bool,
2025-05-07T20:33:30.9432345Z     ) -> None:
2025-05-07T20:33:30.9432574Z         torch.manual_seed(2025)
2025-05-07T20:33:30.9432831Z     
2025-05-07T20:33:30.9433113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.9433485Z     
2025-05-07T20:33:30.9433887Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.9434185Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.9434507Z         x = x_sign * x_clamp
2025-05-07T20:33:30.9434754Z         x0 = x[:, :D]
2025-05-07T20:33:30.9434969Z         x1 = x[:, D:]
2025-05-07T20:33:30.9435181Z     
2025-05-07T20:33:30.9435371Z         if contiguous:
2025-05-07T20:33:30.9435606Z             x0 = x0.contiguous()
2025-05-07T20:33:30.9435876Z             x1 = x1.contiguous()
2025-05-07T20:33:30.9436122Z     
2025-05-07T20:33:30.9436311Z         if scale_ub is not None:
2025-05-07T20:33:30.9436593Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.9436947Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.9437265Z             )
2025-05-07T20:33:30.9437454Z         else:
2025-05-07T20:33:30.9437662Z             scale_ub_tensor = None
2025-05-07T20:33:30.9437917Z     
2025-05-07T20:33:30.9438144Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.9438473Z             op = silu_mul_quant
2025-05-07T20:33:30.9438727Z             if compiled:
2025-05-07T20:33:30.9438976Z                 op = torch.compile(op)
2025-05-07T20:33:30.9439281Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.9439562Z     
2025-05-07T20:33:30.9439753Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.9440018Z 
2025-05-07T20:33:30.9440119Z moe/activation_test.py:117: 
2025-05-07T20:33:30.9440531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.9440878Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.9441157Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.9441750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:30.9442427Z     return fn(*args, **kwargs)
2025-05-07T20:33:30.9443130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.9443875Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.9444443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.9445174Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.9445881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.9446454Z     kernel = self.compile(
2025-05-07T20:33:30.9447026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.9447739Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.9448155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.9448397Z 
2025-05-07T20:33:30.9448610Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd640520>
2025-05-07T20:33:30.9449785Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.9451301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd705790>}
2025-05-07T20:33:30.9452773Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.9453883Z context = <triton._C.libtriton.ir.context object at 0x7f58fd696570>
2025-05-07T20:33:30.9454189Z 
2025-05-07T20:33:30.9454364Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.9454958Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.9455457Z                            module_map=module_map)
2025-05-07T20:33:30.9455828Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.9456191Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.9456456Z E       ^
2025-05-07T20:33:30.9456952Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.9457442Z 
2025-05-07T20:33:30.9457889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.9458448Z 
2025-05-07T20:33:30.9458551Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.9458978Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.9459398Z     T=128,
2025-05-07T20:33:30.9459586Z     D=5120,
2025-05-07T20:33:30.9459776Z     scale_ub=1200.0,
2025-05-07T20:33:30.9460001Z     contiguous=False,
2025-05-07T20:33:30.9460225Z     compiled=True,
2025-05-07T20:33:30.9460430Z )
2025-05-07T20:33:31.0787309Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.0788567Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:31.0788880Z 
2025-05-07T20:33:31.0789269Z     @given(
2025-05-07T20:33:31.0789506Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.0790098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.0790414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.0790749Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.0791092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.0791386Z     )
2025-05-07T20:33:31.0791831Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.0792299Z     def test_silu_mul_quant(
2025-05-07T20:33:31.0792546Z         self,
2025-05-07T20:33:31.0792744Z         T: int,
2025-05-07T20:33:31.0792939Z         D: int,
2025-05-07T20:33:31.0793164Z         scale_ub: Optional[float],
2025-05-07T20:33:31.0793444Z         contiguous: bool,
2025-05-07T20:33:31.0793683Z         compiled: bool,
2025-05-07T20:33:31.0793913Z     ) -> None:
2025-05-07T20:33:31.0794131Z         torch.manual_seed(2025)
2025-05-07T20:33:31.0794377Z     
2025-05-07T20:33:31.0794655Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.0795019Z     
2025-05-07T20:33:31.0795209Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.0795510Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.0795831Z         x = x_sign * x_clamp
2025-05-07T20:33:31.0796069Z         x0 = x[:, :D]
2025-05-07T20:33:31.0796290Z         x1 = x[:, D:]
2025-05-07T20:33:31.0796503Z     
2025-05-07T20:33:31.0796687Z         if contiguous:
2025-05-07T20:33:31.0796921Z             x0 = x0.contiguous()
2025-05-07T20:33:31.0797186Z             x1 = x1.contiguous()
2025-05-07T20:33:31.0797429Z     
2025-05-07T20:33:31.0797624Z         if scale_ub is not None:
2025-05-07T20:33:31.0797902Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.0798246Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.0798561Z             )
2025-05-07T20:33:31.0798760Z         else:
2025-05-07T20:33:31.0798975Z             scale_ub_tensor = None
2025-05-07T20:33:31.0799231Z     
2025-05-07T20:33:31.0799465Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.0799800Z             op = silu_mul_quant
2025-05-07T20:33:31.0800050Z             if compiled:
2025-05-07T20:33:31.0800299Z                 op = torch.compile(op)
2025-05-07T20:33:31.0800605Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.0800884Z     
2025-05-07T20:33:31.0801078Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.0801244Z 
2025-05-07T20:33:31.0801348Z moe/activation_test.py:117: 
2025-05-07T20:33:31.0801729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.0802080Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.0802366Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.0802959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.0803554Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.0804262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.0805016Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.0805579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.0806314Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.0807028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.0807604Z     kernel = self.compile(
2025-05-07T20:33:31.0808173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.0808877Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.0809338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.0809580Z 
2025-05-07T20:33:31.0809840Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd5a8580>
2025-05-07T20:33:31.0811008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.0812572Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd5cc0d0>}
2025-05-07T20:33:31.0814046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.0815296Z context = <triton._C.libtriton.ir.context object at 0x7f58fd5ae2b0>
2025-05-07T20:33:31.0815629Z 
2025-05-07T20:33:31.0815850Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.0816409Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.0816900Z                            module_map=module_map)
2025-05-07T20:33:31.0817276Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.0817630Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.0817901Z E       ^
2025-05-07T20:33:31.0818406Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.0818897Z 
2025-05-07T20:33:31.0819347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.0819908Z 
2025-05-07T20:33:31.0820010Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.0820443Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.0820869Z     T=16384,
2025-05-07T20:33:31.0821053Z     D=7168,
2025-05-07T20:33:31.0821252Z     scale_ub=1200.0,
2025-05-07T20:33:31.0821474Z     contiguous=True,
2025-05-07T20:33:31.0821690Z     compiled=True,
2025-05-07T20:33:31.0821895Z )
2025-05-07T20:33:31.0822224Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.0822740Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.0823042Z 
2025-05-07T20:33:31.0823119Z     @given(
2025-05-07T20:33:31.0823351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.0823740Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.0824049Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.0824388Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.0824726Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.0825014Z     )
2025-05-07T20:33:31.0825375Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.0825841Z     def test_silu_mul_quant(
2025-05-07T20:33:31.0826077Z         self,
2025-05-07T20:33:31.0826272Z         T: int,
2025-05-07T20:33:31.0826469Z         D: int,
2025-05-07T20:33:31.0826680Z         scale_ub: Optional[float],
2025-05-07T20:33:31.0826954Z         contiguous: bool,
2025-05-07T20:33:31.0827195Z         compiled: bool,
2025-05-07T20:33:31.0827417Z     ) -> None:
2025-05-07T20:33:31.0827633Z         torch.manual_seed(2025)
2025-05-07T20:33:31.0827876Z     
2025-05-07T20:33:31.0828151Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.0828503Z     
2025-05-07T20:33:31.0828698Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.0828990Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.0829302Z         x = x_sign * x_clamp
2025-05-07T20:33:31.0829548Z         x0 = x[:, :D]
2025-05-07T20:33:31.0829899Z         x1 = x[:, D:]
2025-05-07T20:33:31.0830177Z     
2025-05-07T20:33:31.0830366Z         if contiguous:
2025-05-07T20:33:31.0830637Z             x0 = x0.contiguous()
2025-05-07T20:33:31.0830893Z             x1 = x1.contiguous()
2025-05-07T20:33:31.0831138Z     
2025-05-07T20:33:31.0831329Z         if scale_ub is not None:
2025-05-07T20:33:31.0831598Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.0831939Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.0832302Z             )
2025-05-07T20:33:31.0832486Z         else:
2025-05-07T20:33:31.0832701Z             scale_ub_tensor = None
2025-05-07T20:33:31.0832959Z     
2025-05-07T20:33:31.0833196Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.0833521Z             op = silu_mul_quant
2025-05-07T20:33:31.0833779Z             if compiled:
2025-05-07T20:33:31.0834031Z                 op = torch.compile(op)
2025-05-07T20:33:31.0834329Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.0834609Z     
2025-05-07T20:33:31.0834807Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.0834981Z 
2025-05-07T20:33:31.0835079Z moe/activation_test.py:117: 
2025-05-07T20:33:31.0835383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.0835730Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.0836011Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.0836601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.0837211Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.0837920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.0838653Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.0839223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.0839955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.0840665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.0841227Z     kernel = self.compile(
2025-05-07T20:33:31.0841794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.0842497Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.0842905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.0843157Z 
2025-05-07T20:33:31.0843423Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd63c190>
2025-05-07T20:33:31.0844590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.0846101Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd5ccd30>}
2025-05-07T20:33:31.0847571Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.0848675Z context = <triton._C.libtriton.ir.context object at 0x7f58fd5231b0>
2025-05-07T20:33:31.0848984Z 
2025-05-07T20:33:31.0849153Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.0849706Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.0850198Z                            module_map=module_map)
2025-05-07T20:33:31.0850568Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.0850932Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.0851277Z E       ^
2025-05-07T20:33:31.0851766Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.0852301Z 
2025-05-07T20:33:31.0852748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.0853307Z 
2025-05-07T20:33:31.3611397Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.3612321Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.3612778Z     T=16384,
2025-05-07T20:33:31.3612980Z     D=5120,
2025-05-07T20:33:31.3613170Z     scale_ub=1200.0,
2025-05-07T20:33:31.3613403Z     contiguous=True,
2025-05-07T20:33:31.3621088Z     compiled=False,
2025-05-07T20:33:31.3621333Z )
2025-05-07T20:33:31.3621672Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.3622223Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:31.3622545Z 
2025-05-07T20:33:31.3622633Z     @given(
2025-05-07T20:33:31.3622892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.3623217Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.3623539Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.3623885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.3624221Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.3624529Z     )
2025-05-07T20:33:31.3624898Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.3625367Z     def test_silu_mul_quant(
2025-05-07T20:33:31.3625623Z         self,
2025-05-07T20:33:31.3625824Z         T: int,
2025-05-07T20:33:31.3626023Z         D: int,
2025-05-07T20:33:31.3626252Z         scale_ub: Optional[float],
2025-05-07T20:33:31.3626538Z         contiguous: bool,
2025-05-07T20:33:31.3626788Z         compiled: bool,
2025-05-07T20:33:31.3627015Z     ) -> None:
2025-05-07T20:33:31.3627240Z         torch.manual_seed(2025)
2025-05-07T20:33:31.3627496Z     
2025-05-07T20:33:31.3627779Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.3628145Z     
2025-05-07T20:33:31.3628349Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.3628647Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.3628977Z         x = x_sign * x_clamp
2025-05-07T20:33:31.3629232Z         x0 = x[:, :D]
2025-05-07T20:33:31.3629448Z         x1 = x[:, D:]
2025-05-07T20:33:31.3629663Z     
2025-05-07T20:33:31.3629976Z         if contiguous:
2025-05-07T20:33:31.3630210Z             x0 = x0.contiguous()
2025-05-07T20:33:31.3630649Z             x1 = x1.contiguous()
2025-05-07T20:33:31.3630906Z     
2025-05-07T20:33:31.3631104Z         if scale_ub is not None:
2025-05-07T20:33:31.3631390Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.3631738Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.3632068Z             )
2025-05-07T20:33:31.3632263Z         else:
2025-05-07T20:33:31.3632475Z             scale_ub_tensor = None
2025-05-07T20:33:31.3632740Z     
2025-05-07T20:33:31.3632979Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.3633310Z             op = silu_mul_quant
2025-05-07T20:33:31.3633561Z             if compiled:
2025-05-07T20:33:31.3633818Z                 op = torch.compile(op)
2025-05-07T20:33:31.3634129Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.3634413Z     
2025-05-07T20:33:31.3634609Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.3634789Z 
2025-05-07T20:33:31.3634891Z moe/activation_test.py:117: 
2025-05-07T20:33:31.3635202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.3635548Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.3635843Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.3636673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.3637511Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.3638077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.3638809Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.3639521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.3640134Z     kernel = self.compile(
2025-05-07T20:33:31.3640717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.3641421Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.3641828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.3642080Z 
2025-05-07T20:33:31.3642296Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd434280>
2025-05-07T20:33:31.3643468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.3644998Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd7fe700>}
2025-05-07T20:33:31.3646471Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.3647567Z context = <triton._C.libtriton.ir.context object at 0x7f58fd418530>
2025-05-07T20:33:31.3647880Z 
2025-05-07T20:33:31.3648051Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.3648599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.3649094Z                            module_map=module_map)
2025-05-07T20:33:31.3649465Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.3649830Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.3650097Z E       ^
2025-05-07T20:33:31.3650583Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.3651080Z 
2025-05-07T20:33:31.3651579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.3652140Z 
2025-05-07T20:33:31.3652243Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.3652672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.3653093Z     T=1,
2025-05-07T20:33:31.3653280Z     D=7168,
2025-05-07T20:33:31.3653477Z     scale_ub=1200.0,
2025-05-07T20:33:31.3653702Z     contiguous=False,
2025-05-07T20:33:31.3653937Z     compiled=False,
2025-05-07T20:33:31.3654152Z )
2025-05-07T20:33:31.3654472Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.3654993Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:31.3655283Z 
2025-05-07T20:33:31.3655362Z     @given(
2025-05-07T20:33:31.3655598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.3655918Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.3656236Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.3656586Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.3656916Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.3657221Z     )
2025-05-07T20:33:31.3657585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.3658046Z     def test_silu_mul_quant(
2025-05-07T20:33:31.3658344Z         self,
2025-05-07T20:33:31.3658540Z         T: int,
2025-05-07T20:33:31.3658772Z         D: int,
2025-05-07T20:33:31.3658991Z         scale_ub: Optional[float],
2025-05-07T20:33:31.3659270Z         contiguous: bool,
2025-05-07T20:33:31.3659504Z         compiled: bool,
2025-05-07T20:33:31.3659728Z     ) -> None:
2025-05-07T20:33:31.3659945Z         torch.manual_seed(2025)
2025-05-07T20:33:31.3660193Z     
2025-05-07T20:33:31.3660507Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.3660864Z     
2025-05-07T20:33:31.3661059Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.3661354Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.3661674Z         x = x_sign * x_clamp
2025-05-07T20:33:31.3661920Z         x0 = x[:, :D]
2025-05-07T20:33:31.3662130Z         x1 = x[:, D:]
2025-05-07T20:33:31.3662340Z     
2025-05-07T20:33:31.3662525Z         if contiguous:
2025-05-07T20:33:31.3662750Z             x0 = x0.contiguous()
2025-05-07T20:33:31.3663018Z             x1 = x1.contiguous()
2025-05-07T20:33:31.3663261Z     
2025-05-07T20:33:31.3663447Z         if scale_ub is not None:
2025-05-07T20:33:31.3663726Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.3664064Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.3664379Z             )
2025-05-07T20:33:31.3664573Z         else:
2025-05-07T20:33:31.3664780Z             scale_ub_tensor = None
2025-05-07T20:33:31.3665033Z     
2025-05-07T20:33:31.3665266Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.3665603Z             op = silu_mul_quant
2025-05-07T20:33:31.3665859Z             if compiled:
2025-05-07T20:33:31.3666103Z                 op = torch.compile(op)
2025-05-07T20:33:31.3666407Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.3666688Z     
2025-05-07T20:33:31.3666876Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.3667050Z 
2025-05-07T20:33:31.3667150Z moe/activation_test.py:117: 
2025-05-07T20:33:31.3667453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.3667792Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.3668079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.3668815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.3669556Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.3670257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.3671205Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.3672006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.3672575Z     kernel = self.compile(
2025-05-07T20:33:31.3673157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.3673862Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.3674282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.3674525Z 
2025-05-07T20:33:31.3674741Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd9431c0>
2025-05-07T20:33:31.3675910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.3677417Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd54f0d0>}
2025-05-07T20:33:31.3678933Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.3680077Z context = <triton._C.libtriton.ir.context object at 0x7f58fd552c30>
2025-05-07T20:33:31.3680380Z 
2025-05-07T20:33:31.3680550Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.3681098Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.3681630Z                            module_map=module_map)
2025-05-07T20:33:31.3681999Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.3682365Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.3682634Z E       ^
2025-05-07T20:33:31.3683494Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.3683986Z 
2025-05-07T20:33:31.3684435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.3685001Z 
2025-05-07T20:33:31.3685102Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.3685532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.3685950Z     T=4096,
2025-05-07T20:33:31.3686143Z     D=7168,
2025-05-07T20:33:31.3686335Z     scale_ub=1200.0,
2025-05-07T20:33:31.3686558Z     contiguous=False,
2025-05-07T20:33:31.3686775Z     compiled=True,
2025-05-07T20:33:31.3686979Z )
2025-05-07T20:33:31.4859451Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.4860189Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:31.4860545Z 
2025-05-07T20:33:31.4860634Z     @given(
2025-05-07T20:33:31.4860890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.4861239Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.4861589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.4861978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.4862347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.4862666Z     )
2025-05-07T20:33:31.4863038Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.4863515Z     def test_silu_mul_quant(
2025-05-07T20:33:31.4863768Z         self,
2025-05-07T20:33:31.4863972Z         T: int,
2025-05-07T20:33:31.4864182Z         D: int,
2025-05-07T20:33:31.4864417Z         scale_ub: Optional[float],
2025-05-07T20:33:31.4864704Z         contiguous: bool,
2025-05-07T20:33:31.4864973Z         compiled: bool,
2025-05-07T20:33:31.4865217Z     ) -> None:
2025-05-07T20:33:31.4865670Z         torch.manual_seed(2025)
2025-05-07T20:33:31.4865932Z     
2025-05-07T20:33:31.4866213Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.4866572Z     
2025-05-07T20:33:31.4866768Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.4867073Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.4867400Z         x = x_sign * x_clamp
2025-05-07T20:33:31.4867646Z         x0 = x[:, :D]
2025-05-07T20:33:31.4867870Z         x1 = x[:, D:]
2025-05-07T20:33:31.4868087Z     
2025-05-07T20:33:31.4868273Z         if contiguous:
2025-05-07T20:33:31.4868538Z             x0 = x0.contiguous()
2025-05-07T20:33:31.4868834Z             x1 = x1.contiguous()
2025-05-07T20:33:31.4869079Z     
2025-05-07T20:33:31.4869281Z         if scale_ub is not None:
2025-05-07T20:33:31.4869564Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.4870097Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.4870426Z             )
2025-05-07T20:33:31.4870624Z         else:
2025-05-07T20:33:31.4870830Z             scale_ub_tensor = None
2025-05-07T20:33:31.4871093Z     
2025-05-07T20:33:31.4871332Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.4871658Z             op = silu_mul_quant
2025-05-07T20:33:31.4872018Z             if compiled:
2025-05-07T20:33:31.4872273Z                 op = torch.compile(op)
2025-05-07T20:33:31.4872688Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.4872972Z     
2025-05-07T20:33:31.4873170Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.4873341Z 
2025-05-07T20:33:31.4873443Z moe/activation_test.py:117: 
2025-05-07T20:33:31.4873741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.4874188Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.4874480Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.4875073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.4875673Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.4876380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.4877122Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.4877687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.4878422Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.4879127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.4879688Z     kernel = self.compile(
2025-05-07T20:33:31.4880264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.4880965Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.4881380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.4881621Z 
2025-05-07T20:33:31.4881834Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd39dd90>
2025-05-07T20:33:31.4883340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.4884883Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd54fdc0>}
2025-05-07T20:33:31.4886369Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.4887582Z context = <triton._C.libtriton.ir.context object at 0x7f58fd497d30>
2025-05-07T20:33:31.4887891Z 
2025-05-07T20:33:31.4888062Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.4888618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.4889115Z                            module_map=module_map)
2025-05-07T20:33:31.4889489Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.4889858Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.4890128Z E       ^
2025-05-07T20:33:31.4890623Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.4891115Z 
2025-05-07T20:33:31.4891563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.4892128Z 
2025-05-07T20:33:31.4892232Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.4892667Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.4893093Z     T=128,
2025-05-07T20:33:31.4893284Z     D=7168,
2025-05-07T20:33:31.4893477Z     scale_ub=1200.0,
2025-05-07T20:33:31.4893700Z     contiguous=False,
2025-05-07T20:33:31.4893922Z     compiled=True,
2025-05-07T20:33:31.4894128Z )
2025-05-07T20:33:31.4894550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.4895194Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:31.4895487Z 
2025-05-07T20:33:31.4895565Z     @given(
2025-05-07T20:33:31.4895874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.4896192Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.4896589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.4896926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.4897255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.4897548Z     )
2025-05-07T20:33:31.4897910Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.4898376Z     def test_silu_mul_quant(
2025-05-07T20:33:31.4898617Z         self,
2025-05-07T20:33:31.4898811Z         T: int,
2025-05-07T20:33:31.4899012Z         D: int,
2025-05-07T20:33:31.4899228Z         scale_ub: Optional[float],
2025-05-07T20:33:31.4899503Z         contiguous: bool,
2025-05-07T20:33:31.4899749Z         compiled: bool,
2025-05-07T20:33:31.4899968Z     ) -> None:
2025-05-07T20:33:31.4900181Z         torch.manual_seed(2025)
2025-05-07T20:33:31.4900422Z     
2025-05-07T20:33:31.4900692Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.4901049Z     
2025-05-07T20:33:31.4901253Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.4901542Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.4901859Z         x = x_sign * x_clamp
2025-05-07T20:33:31.4902099Z         x0 = x[:, :D]
2025-05-07T20:33:31.4902314Z         x1 = x[:, D:]
2025-05-07T20:33:31.4902521Z     
2025-05-07T20:33:31.4902706Z         if contiguous:
2025-05-07T20:33:31.4902932Z             x0 = x0.contiguous()
2025-05-07T20:33:31.4903192Z             x1 = x1.contiguous()
2025-05-07T20:33:31.4903441Z     
2025-05-07T20:33:31.4903634Z         if scale_ub is not None:
2025-05-07T20:33:31.4903907Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.4904254Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.4904575Z             )
2025-05-07T20:33:31.4904758Z         else:
2025-05-07T20:33:31.4904968Z             scale_ub_tensor = None
2025-05-07T20:33:31.4905229Z     
2025-05-07T20:33:31.4905453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.4905784Z             op = silu_mul_quant
2025-05-07T20:33:31.4906037Z             if compiled:
2025-05-07T20:33:31.4906280Z                 op = torch.compile(op)
2025-05-07T20:33:31.4906633Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.4906920Z     
2025-05-07T20:33:31.4907110Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.4907283Z 
2025-05-07T20:33:31.4907381Z moe/activation_test.py:117: 
2025-05-07T20:33:31.4907686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.4908035Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.4908317Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.4908911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.4909513Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.4910377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.4911123Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.4911692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.4912427Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.4913129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.4913695Z     kernel = self.compile(
2025-05-07T20:33:31.4914315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.4915050Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.4915469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.4915716Z 
2025-05-07T20:33:31.4915928Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd3cdd60>
2025-05-07T20:33:31.4917144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.4918654Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd466940>}
2025-05-07T20:33:31.4920118Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.4921231Z context = <triton._C.libtriton.ir.context object at 0x7f58fd387a70>
2025-05-07T20:33:31.4921542Z 
2025-05-07T20:33:31.4921709Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.4922262Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.4922750Z                            module_map=module_map)
2025-05-07T20:33:31.4923131Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.4923497Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.4923757Z E       ^
2025-05-07T20:33:31.4924253Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.4924748Z 
2025-05-07T20:33:31.4925198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.4925756Z 
2025-05-07T20:33:31.6633561Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.6634260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.6634868Z     T=2048,
2025-05-07T20:33:31.6635095Z     D=7168,
2025-05-07T20:33:31.6635293Z     scale_ub=None,
2025-05-07T20:33:31.6635506Z     contiguous=True,
2025-05-07T20:33:31.6635755Z     compiled=True,
2025-05-07T20:33:31.6635967Z )
2025-05-07T20:33:31.6636294Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.6637102Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.6637390Z 
2025-05-07T20:33:31.6637475Z     @given(
2025-05-07T20:33:31.6637702Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.6638028Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.6638351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.6638695Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.6639034Z         compiled=st.sampled_from([True, False]),
﻿2025-05-07T20:33:31.6642310Z     )
2025-05-07T20:33:31.6642677Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.6643149Z     def test_silu_mul_quant(
2025-05-07T20:33:31.6643397Z         self,
2025-05-07T20:33:31.6643603Z         T: int,
2025-05-07T20:33:31.6643803Z         D: int,
2025-05-07T20:33:31.6644023Z         scale_ub: Optional[float],
2025-05-07T20:33:31.6644301Z         contiguous: bool,
2025-05-07T20:33:31.6644543Z         compiled: bool,
2025-05-07T20:33:31.6644785Z     ) -> None:
2025-05-07T20:33:31.6645004Z         torch.manual_seed(2025)
2025-05-07T20:33:31.6645248Z     
2025-05-07T20:33:31.6645534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.6645896Z     
2025-05-07T20:33:31.6646089Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.6646485Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.6646812Z         x = x_sign * x_clamp
2025-05-07T20:33:31.6647055Z         x0 = x[:, :D]
2025-05-07T20:33:31.6647308Z         x1 = x[:, D:]
2025-05-07T20:33:31.6647517Z     
2025-05-07T20:33:31.6647710Z         if contiguous:
2025-05-07T20:33:31.6647948Z             x0 = x0.contiguous()
2025-05-07T20:33:31.6648211Z             x1 = x1.contiguous()
2025-05-07T20:33:31.6648547Z     
2025-05-07T20:33:31.6648790Z         if scale_ub is not None:
2025-05-07T20:33:31.6649084Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.6649436Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.6649763Z             )
2025-05-07T20:33:31.6649955Z         else:
2025-05-07T20:33:31.6650171Z             scale_ub_tensor = None
2025-05-07T20:33:31.6657376Z     
2025-05-07T20:33:31.6657662Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.6658020Z             op = silu_mul_quant
2025-05-07T20:33:31.6658287Z             if compiled:
2025-05-07T20:33:31.6658540Z                 op = torch.compile(op)
2025-05-07T20:33:31.6658853Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.6659143Z     
2025-05-07T20:33:31.6659338Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.6659516Z 
2025-05-07T20:33:31.6659619Z moe/activation_test.py:117: 
2025-05-07T20:33:31.6659932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.6660291Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.6660581Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.6661193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.6661833Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.6662555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.6663302Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.6663884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.6664625Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.6665347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.6665914Z     kernel = self.compile(
2025-05-07T20:33:31.6666494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.6667282Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.6667704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.6667957Z 
2025-05-07T20:33:31.6668174Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd6040d0>
2025-05-07T20:33:31.6669355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.6671164Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd5f3550>}
2025-05-07T20:33:31.6672642Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.6673752Z context = <triton._C.libtriton.ir.context object at 0x7f58fd31bcf0>
2025-05-07T20:33:31.6674068Z 
2025-05-07T20:33:31.6674240Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.6674797Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.6675345Z                            module_map=module_map)
2025-05-07T20:33:31.6675722Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.6676103Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.6676384Z E       ^
2025-05-07T20:33:31.6676877Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.6677422Z 
2025-05-07T20:33:31.6677870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.6678431Z 
2025-05-07T20:33:31.6678541Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.6678972Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.6679395Z     T=16384,
2025-05-07T20:33:31.6679594Z     D=5120,
2025-05-07T20:33:31.6679791Z     scale_ub=None,
2025-05-07T20:33:31.6680013Z     contiguous=False,
2025-05-07T20:33:31.6680249Z     compiled=False,
2025-05-07T20:33:31.6680454Z )
2025-05-07T20:33:31.6680781Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.6681306Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.6681604Z 
2025-05-07T20:33:31.6681685Z     @given(
2025-05-07T20:33:31.6681911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.6682237Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.6682559Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.6683253Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.6683599Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.6683893Z     )
2025-05-07T20:33:31.6684250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.6684715Z     def test_silu_mul_quant(
2025-05-07T20:33:31.6684962Z         self,
2025-05-07T20:33:31.6685161Z         T: int,
2025-05-07T20:33:31.6685353Z         D: int,
2025-05-07T20:33:31.6685572Z         scale_ub: Optional[float],
2025-05-07T20:33:31.6685847Z         contiguous: bool,
2025-05-07T20:33:31.6686088Z         compiled: bool,
2025-05-07T20:33:31.6686318Z     ) -> None:
2025-05-07T20:33:31.6686536Z         torch.manual_seed(2025)
2025-05-07T20:33:31.6686779Z     
2025-05-07T20:33:31.6687057Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.6687413Z     
2025-05-07T20:33:31.6687604Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.6687902Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.6690264Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.6692419Z 
2025-05-07T20:33:31.6692536Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:31.6692757Z 
2025-05-07T20:33:31.6692867Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.6693293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.6693708Z     T=4096,
2025-05-07T20:33:31.6693891Z     D=7168,
2025-05-07T20:33:31.6694081Z     scale_ub=1200.0,
2025-05-07T20:33:31.6694299Z     contiguous=True,
2025-05-07T20:33:31.6694521Z     compiled=True,
2025-05-07T20:33:31.6694726Z )
2025-05-07T20:33:31.6695046Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.6695565Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.6695852Z 
2025-05-07T20:33:31.6696006Z     @given(
2025-05-07T20:33:31.6696231Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.6696552Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.6696871Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.6697212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.6697543Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.6697906Z     )
2025-05-07T20:33:31.6698266Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.6698727Z     def test_silu_mul_quant(
2025-05-07T20:33:31.6698968Z         self,
2025-05-07T20:33:31.6699163Z         T: int,
2025-05-07T20:33:31.6699353Z         D: int,
2025-05-07T20:33:31.6699570Z         scale_ub: Optional[float],
2025-05-07T20:33:31.6699846Z         contiguous: bool,
2025-05-07T20:33:31.6700078Z         compiled: bool,
2025-05-07T20:33:31.6700295Z     ) -> None:
2025-05-07T20:33:31.6700509Z         torch.manual_seed(2025)
2025-05-07T20:33:31.6700745Z     
2025-05-07T20:33:31.6701017Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.6701373Z     
2025-05-07T20:33:31.6701563Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.6701855Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.6704050Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.6706126Z 
2025-05-07T20:33:31.6706246Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:31.6706466Z 
2025-05-07T20:33:31.6706568Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.6706986Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.6707411Z     T=16384,
2025-05-07T20:33:31.6707604Z     D=7168,
2025-05-07T20:33:31.6707789Z     scale_ub=None,
2025-05-07T20:33:31.6708004Z     contiguous=False,
2025-05-07T20:33:31.6708227Z     compiled=False,
2025-05-07T20:33:31.6708427Z )
2025-05-07T20:33:31.7750303Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.7751150Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.7751488Z 
2025-05-07T20:33:31.7751575Z     @given(
2025-05-07T20:33:31.7751802Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.7752129Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.7752443Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.7752792Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.7753126Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.7753426Z     )
2025-05-07T20:33:31.7753794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.7754349Z     def test_silu_mul_quant(
2025-05-07T20:33:31.7754595Z         self,
2025-05-07T20:33:31.7754793Z         T: int,
2025-05-07T20:33:31.7754993Z         D: int,
2025-05-07T20:33:31.7755215Z         scale_ub: Optional[float],
2025-05-07T20:33:31.7755494Z         contiguous: bool,
2025-05-07T20:33:31.7755734Z         compiled: bool,
2025-05-07T20:33:31.7755964Z     ) -> None:
2025-05-07T20:33:31.7756184Z         torch.manual_seed(2025)
2025-05-07T20:33:31.7756428Z     
2025-05-07T20:33:31.7756704Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.7759067Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.7761246Z 
2025-05-07T20:33:31.7761384Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.7761608Z 
2025-05-07T20:33:31.7761708Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.7762134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.7762551Z     T=2048,
2025-05-07T20:33:31.7762739Z     D=7168,
2025-05-07T20:33:31.7762933Z     scale_ub=1200.0,
2025-05-07T20:33:31.7763147Z     contiguous=True,
2025-05-07T20:33:31.7763365Z     compiled=True,
2025-05-07T20:33:31.7763574Z )
2025-05-07T20:33:31.7763891Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.7764407Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.7764700Z 
2025-05-07T20:33:31.7764776Z     @given(
2025-05-07T20:33:31.7765000Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.7765312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.7765627Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.7765962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.7766290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.7766586Z     )
2025-05-07T20:33:31.7766945Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.7767403Z     def test_silu_mul_quant(
2025-05-07T20:33:31.7767646Z         self,
2025-05-07T20:33:31.7767839Z         T: int,
2025-05-07T20:33:31.7768028Z         D: int,
2025-05-07T20:33:31.7768246Z         scale_ub: Optional[float],
2025-05-07T20:33:31.7768520Z         contiguous: bool,
2025-05-07T20:33:31.7768762Z         compiled: bool,
2025-05-07T20:33:31.7768978Z     ) -> None:
2025-05-07T20:33:31.7769193Z         torch.manual_seed(2025)
2025-05-07T20:33:31.7769439Z     
2025-05-07T20:33:31.7769706Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.7770061Z     
2025-05-07T20:33:31.7770256Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.7770543Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.7772790Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.7774890Z 
2025-05-07T20:33:31.7775006Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:31.7775231Z 
2025-05-07T20:33:31.7775334Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.7775760Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.7776180Z     T=2048,
2025-05-07T20:33:31.7776365Z     D=7168,
2025-05-07T20:33:31.7776553Z     scale_ub=None,
2025-05-07T20:33:31.7776758Z     contiguous=True,
2025-05-07T20:33:31.7776981Z     compiled=False,
2025-05-07T20:33:31.7777181Z )
2025-05-07T20:33:31.7777498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.7778015Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:31.7778300Z 
2025-05-07T20:33:31.7778381Z     @given(
2025-05-07T20:33:31.7778677Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.7779002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.7779324Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.7779668Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.7780008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.7780310Z     )
2025-05-07T20:33:31.7780714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.7781186Z     def test_silu_mul_quant(
2025-05-07T20:33:31.7781431Z         self,
2025-05-07T20:33:31.7781616Z         T: int,
2025-05-07T20:33:31.7781813Z         D: int,
2025-05-07T20:33:31.7782031Z         scale_ub: Optional[float],
2025-05-07T20:33:31.7782298Z         contiguous: bool,
2025-05-07T20:33:31.7782537Z         compiled: bool,
2025-05-07T20:33:31.7783005Z     ) -> None:
2025-05-07T20:33:31.7783255Z         torch.manual_seed(2025)
2025-05-07T20:33:31.7783514Z     
2025-05-07T20:33:31.7783798Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.7784163Z     
2025-05-07T20:33:31.7784358Z >       x_sign = torch.sign(x)
2025-05-07T20:33:31.7786498Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.7788556Z 
2025-05-07T20:33:31.7788674Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:31.7788895Z 
2025-05-07T20:33:31.7789004Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.7789427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.7789930Z     T=1,
2025-05-07T20:33:31.7790125Z     D=7168,
2025-05-07T20:33:31.7790325Z     scale_ub=1200.0,
2025-05-07T20:33:31.7790548Z     contiguous=True,
2025-05-07T20:33:31.7790779Z     compiled=False,
2025-05-07T20:33:31.7790991Z )
2025-05-07T20:33:32.1129772Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.1130398Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:32.1130689Z 
2025-05-07T20:33:32.1130774Z     @given(
2025-05-07T20:33:32.1131312Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.1131634Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.1131947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.1132291Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.1132630Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.1132937Z     )
2025-05-07T20:33:32.1133306Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.1133765Z     def test_silu_mul_quant(
2025-05-07T20:33:32.1134111Z         self,
2025-05-07T20:33:32.1134304Z         T: int,
2025-05-07T20:33:32.1134500Z         D: int,
2025-05-07T20:33:32.1134721Z         scale_ub: Optional[float],
2025-05-07T20:33:32.1134997Z         contiguous: bool,
2025-05-07T20:33:32.1135234Z         compiled: bool,
2025-05-07T20:33:32.1135470Z     ) -> None:
2025-05-07T20:33:32.1135685Z         torch.manual_seed(2025)
2025-05-07T20:33:32.1135927Z     
2025-05-07T20:33:32.1136209Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.1136565Z     
2025-05-07T20:33:32.1136757Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.1137048Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.1137366Z         x = x_sign * x_clamp
2025-05-07T20:33:32.1137713Z         x0 = x[:, :D]
2025-05-07T20:33:32.1137928Z         x1 = x[:, D:]
2025-05-07T20:33:32.1138141Z     
2025-05-07T20:33:32.1138329Z         if contiguous:
2025-05-07T20:33:32.1138557Z             x0 = x0.contiguous()
2025-05-07T20:33:32.1138823Z             x1 = x1.contiguous()
2025-05-07T20:33:32.1139073Z     
2025-05-07T20:33:32.1139262Z         if scale_ub is not None:
2025-05-07T20:33:32.1139541Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.1139971Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.1140286Z             )
2025-05-07T20:33:32.1140482Z         else:
2025-05-07T20:33:32.1140695Z             scale_ub_tensor = None
2025-05-07T20:33:32.1140944Z     
2025-05-07T20:33:32.1141177Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.1141504Z             op = silu_mul_quant
2025-05-07T20:33:32.1141761Z             if compiled:
2025-05-07T20:33:32.1142005Z                 op = torch.compile(op)
2025-05-07T20:33:32.1142311Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.1142594Z     
2025-05-07T20:33:32.1142781Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:32.1142959Z 
2025-05-07T20:33:32.1143057Z moe/activation_test.py:117: 
2025-05-07T20:33:32.1143363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.1143707Z moe/activation_test.py:115: in fn
2025-05-07T20:33:32.1143994Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.1144741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:32.1145491Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:32.1146054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.1146790Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.1147505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.1148071Z     kernel = self.compile(
2025-05-07T20:33:32.1148644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.1149346Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.1149938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.1150184Z 
2025-05-07T20:33:32.1150395Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd252880>
2025-05-07T20:33:32.1151614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.1153140Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd197040>}
2025-05-07T20:33:32.1154603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.1155760Z context = <triton._C.libtriton.ir.context object at 0x7f58fd18e070>
2025-05-07T20:33:32.1156064Z 
2025-05-07T20:33:32.1156237Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.1156785Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.1157283Z                            module_map=module_map)
2025-05-07T20:33:32.1157653Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.1158016Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:32.1158283Z E       ^
2025-05-07T20:33:32.1158820Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.1159311Z 
2025-05-07T20:33:32.1159758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.1160322Z 
2025-05-07T20:33:32.1160422Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.1160848Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.1161305Z     T=128,
2025-05-07T20:33:32.1161495Z     D=5120,
2025-05-07T20:33:32.1161687Z     scale_ub=None,
2025-05-07T20:33:32.1161896Z     contiguous=True,
2025-05-07T20:33:32.1162114Z     compiled=False,
2025-05-07T20:33:32.1162322Z )
2025-05-07T20:33:32.1162647Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.1163152Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:32.1163437Z 
2025-05-07T20:33:32.1163511Z     @given(
2025-05-07T20:33:32.1163744Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.1164063Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.1164385Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.1164729Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.1165059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.1165358Z     )
2025-05-07T20:33:32.1165720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.1166188Z     def test_silu_mul_quant(
2025-05-07T20:33:32.1166425Z         self,
2025-05-07T20:33:32.1166617Z         T: int,
2025-05-07T20:33:32.1166813Z         D: int,
2025-05-07T20:33:32.1167029Z         scale_ub: Optional[float],
2025-05-07T20:33:32.1167303Z         contiguous: bool,
2025-05-07T20:33:32.1167545Z         compiled: bool,
2025-05-07T20:33:32.1167763Z     ) -> None:
2025-05-07T20:33:32.1167980Z         torch.manual_seed(2025)
2025-05-07T20:33:32.1168225Z     
2025-05-07T20:33:32.1168493Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.1168851Z     
2025-05-07T20:33:32.1169044Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.1169333Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.1169658Z         x = x_sign * x_clamp
2025-05-07T20:33:32.1169901Z         x0 = x[:, :D]
2025-05-07T20:33:32.1170113Z         x1 = x[:, D:]
2025-05-07T20:33:32.1170324Z     
2025-05-07T20:33:32.1170512Z         if contiguous:
2025-05-07T20:33:32.1170739Z             x0 = x0.contiguous()
2025-05-07T20:33:32.1170998Z             x1 = x1.contiguous()
2025-05-07T20:33:32.1171238Z     
2025-05-07T20:33:32.1171479Z         if scale_ub is not None:
2025-05-07T20:33:32.1171752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.1172093Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.1172410Z             )
2025-05-07T20:33:32.1172594Z         else:
2025-05-07T20:33:32.1172798Z             scale_ub_tensor = None
2025-05-07T20:33:32.1173055Z     
2025-05-07T20:33:32.1173278Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.1173602Z             op = silu_mul_quant
2025-05-07T20:33:32.1173911Z             if compiled:
2025-05-07T20:33:32.1174153Z                 op = torch.compile(op)
2025-05-07T20:33:32.1174453Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.1174734Z     
2025-05-07T20:33:32.1174920Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:32.1175099Z 
2025-05-07T20:33:32.1175197Z moe/activation_test.py:117: 
2025-05-07T20:33:32.1175499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.1175846Z moe/activation_test.py:115: in fn
2025-05-07T20:33:32.1176123Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.1176861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:32.1177651Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:32.1178214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.1178944Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.1179660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.1180225Z     kernel = self.compile(
2025-05-07T20:33:32.1180830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.1181535Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.1181952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.1182193Z 
2025-05-07T20:33:32.1182413Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd1ad3a0>
2025-05-07T20:33:32.1183929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.1185448Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd197a60>}
2025-05-07T20:33:32.1186917Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.1188035Z context = <triton._C.libtriton.ir.context object at 0x7f58fd4e7830>
2025-05-07T20:33:32.1188346Z 
2025-05-07T20:33:32.1188520Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.1189075Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.1189571Z                            module_map=module_map)
2025-05-07T20:33:32.1190045Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.1190404Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:32.1190677Z E       ^
2025-05-07T20:33:32.1191174Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.1191661Z 
2025-05-07T20:33:32.1192110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.1192676Z 
2025-05-07T20:33:32.1192776Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.1193298Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.1193735Z     T=128,
2025-05-07T20:33:32.1200909Z     D=7168,
2025-05-07T20:33:32.1201127Z     scale_ub=None,
2025-05-07T20:33:32.1201348Z     contiguous=True,
2025-05-07T20:33:32.1201589Z     compiled=False,
2025-05-07T20:33:32.1201844Z )
2025-05-07T20:33:32.2100627Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.2101346Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:32.2101934Z 
2025-05-07T20:33:32.2102017Z     @given(
2025-05-07T20:33:32.2102267Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.2102601Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.2102926Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.2103279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.2103627Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.2103927Z     )
2025-05-07T20:33:32.2104296Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.2104772Z     def test_silu_mul_quant(
2025-05-07T20:33:32.2105024Z         self,
2025-05-07T20:33:32.2105214Z         T: int,
2025-05-07T20:33:32.2105417Z         D: int,
2025-05-07T20:33:32.2105736Z         scale_ub: Optional[float],
2025-05-07T20:33:32.2106013Z         contiguous: bool,
2025-05-07T20:33:32.2106264Z         compiled: bool,
2025-05-07T20:33:32.2106501Z     ) -> None:
2025-05-07T20:33:32.2106719Z         torch.manual_seed(2025)
2025-05-07T20:33:32.2106972Z     
2025-05-07T20:33:32.2107252Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.2107608Z     
2025-05-07T20:33:32.2107890Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.2108189Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.2108503Z         x = x_sign * x_clamp
2025-05-07T20:33:32.2108750Z         x0 = x[:, :D]
2025-05-07T20:33:32.2108969Z         x1 = x[:, D:]
2025-05-07T20:33:32.2109173Z     
2025-05-07T20:33:32.2109359Z         if contiguous:
2025-05-07T20:33:32.2109591Z             x0 = x0.contiguous()
2025-05-07T20:33:32.2109980Z             x1 = x1.contiguous()
2025-05-07T20:33:32.2110225Z     
2025-05-07T20:33:32.2110418Z         if scale_ub is not None:
2025-05-07T20:33:32.2110696Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.2111034Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.2111359Z             )
2025-05-07T20:33:32.2111553Z         else:
2025-05-07T20:33:32.2111758Z             scale_ub_tensor = None
2025-05-07T20:33:32.2112016Z     
2025-05-07T20:33:32.2112246Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.2112567Z             op = silu_mul_quant
2025-05-07T20:33:32.2112817Z             if compiled:
2025-05-07T20:33:32.2113066Z                 op = torch.compile(op)
2025-05-07T20:33:32.2113366Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.2113649Z     
2025-05-07T20:33:32.2113844Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:32.2114010Z 
2025-05-07T20:33:32.2114108Z moe/activation_test.py:117: 
2025-05-07T20:33:32.2114411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.2114759Z moe/activation_test.py:115: in fn
2025-05-07T20:33:32.2115046Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.2115782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:32.2116532Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:32.2117105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.2117836Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.2118642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.2119213Z     kernel = self.compile(
2025-05-07T20:33:32.2119792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.2120489Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.2120906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.2121147Z 
2025-05-07T20:33:32.2121367Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd4d3070>
2025-05-07T20:33:32.2122591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.2124107Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd153790>}
2025-05-07T20:33:32.2125571Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.2126721Z context = <triton._C.libtriton.ir.context object at 0x7f58fd16fcb0>
2025-05-07T20:33:32.2127028Z 
2025-05-07T20:33:32.2127204Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.2127747Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.2128241Z                            module_map=module_map)
2025-05-07T20:33:32.2128622Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.2129026Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:32.2129285Z E       ^
2025-05-07T20:33:32.2129782Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.2130272Z 
2025-05-07T20:33:32.2130726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.2131283Z 
2025-05-07T20:33:32.2131393Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.2131815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.2132235Z     T=2048,
2025-05-07T20:33:32.2132420Z     D=7168,
2025-05-07T20:33:32.2132608Z     scale_ub=1200.0,
2025-05-07T20:33:32.2132835Z     contiguous=True,
2025-05-07T20:33:32.2133058Z     compiled=False,
2025-05-07T20:33:32.2133262Z )
2025-05-07T20:33:32.2133584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.2134109Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:32.2134399Z 
2025-05-07T20:33:32.2134475Z     @given(
2025-05-07T20:33:32.2134708Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.2135029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.2135345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.2135676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.2136014Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.2136308Z     )
2025-05-07T20:33:32.2136667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.2137134Z     def test_silu_mul_quant(
2025-05-07T20:33:32.2137379Z         self,
2025-05-07T20:33:32.2137566Z         T: int,
2025-05-07T20:33:32.2137765Z         D: int,
2025-05-07T20:33:32.2137986Z         scale_ub: Optional[float],
2025-05-07T20:33:32.2138255Z         contiguous: bool,
2025-05-07T20:33:32.2138503Z         compiled: bool,
2025-05-07T20:33:32.2138751Z     ) -> None:
2025-05-07T20:33:32.2138986Z         torch.manual_seed(2025)
2025-05-07T20:33:32.2139228Z     
2025-05-07T20:33:32.2139592Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.2141857Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.2143969Z 
2025-05-07T20:33:32.2144087Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:32.2144310Z 
2025-05-07T20:33:32.2144425Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.2144849Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.2145273Z     T=1,
2025-05-07T20:33:32.2145462Z     D=5120,
2025-05-07T20:33:32.2145646Z     scale_ub=1200.0,
2025-05-07T20:33:32.2145870Z     contiguous=True,
2025-05-07T20:33:32.2146093Z     compiled=False,
2025-05-07T20:33:32.2146292Z )
2025-05-07T20:33:32.2639343Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.2640063Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:32.2640353Z 
2025-05-07T20:33:32.2640431Z     @given(
2025-05-07T20:33:32.2640662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.2640987Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.2641305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.2641645Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.2642060Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.2642354Z     )
2025-05-07T20:33:32.2642719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.2643179Z     def test_silu_mul_quant(
2025-05-07T20:33:32.2643422Z         self,
2025-05-07T20:33:32.2643618Z         T: int,
2025-05-07T20:33:32.2643812Z         D: int,
2025-05-07T20:33:32.2644034Z         scale_ub: Optional[float],
2025-05-07T20:33:32.2644315Z         contiguous: bool,
2025-05-07T20:33:32.2644563Z         compiled: bool,
2025-05-07T20:33:32.2644785Z     ) -> None:
2025-05-07T20:33:32.2645002Z         torch.manual_seed(2025)
2025-05-07T20:33:32.2645253Z     
2025-05-07T20:33:32.2645523Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.2645885Z     
2025-05-07T20:33:32.2646079Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.2646369Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.2646698Z         x = x_sign * x_clamp
2025-05-07T20:33:32.2646945Z         x0 = x[:, :D]
2025-05-07T20:33:32.2647163Z         x1 = x[:, D:]
2025-05-07T20:33:32.2647374Z     
2025-05-07T20:33:32.2647564Z         if contiguous:
2025-05-07T20:33:32.2647798Z             x0 = x0.contiguous()
2025-05-07T20:33:32.2648066Z             x1 = x1.contiguous()
2025-05-07T20:33:32.2648317Z     
2025-05-07T20:33:32.2648507Z         if scale_ub is not None:
2025-05-07T20:33:32.2648792Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.2649144Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.2649468Z             )
2025-05-07T20:33:32.2649659Z         else:
2025-05-07T20:33:32.2649874Z             scale_ub_tensor = None
2025-05-07T20:33:32.2650136Z     
2025-05-07T20:33:32.2650366Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.2650696Z             op = silu_mul_quant
2025-05-07T20:33:32.2650954Z             if compiled:
2025-05-07T20:33:32.2651203Z                 op = torch.compile(op)
2025-05-07T20:33:32.2651511Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.2651804Z     
2025-05-07T20:33:32.2651995Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:32.2652254Z 
2025-05-07T20:33:32.2652355Z moe/activation_test.py:117: 
2025-05-07T20:33:32.2652664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.2653003Z moe/activation_test.py:115: in fn
2025-05-07T20:33:32.2653292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.2654039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:32.2654801Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:32.2655463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.2656200Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.2656912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.2657486Z     kernel = self.compile(
2025-05-07T20:33:32.2658065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.2658771Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.2659180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.2659480Z 
2025-05-07T20:33:32.2659696Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd158730>
2025-05-07T20:33:32.2660869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.2662393Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd218040>}
2025-05-07T20:33:32.2663905Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.2665007Z context = <triton._C.libtriton.ir.context object at 0x7f58fd221c30>
2025-05-07T20:33:32.2665322Z 
2025-05-07T20:33:32.2665494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.2666042Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.2666532Z                            module_map=module_map)
2025-05-07T20:33:32.2666915Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.2667282Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:32.2667548Z E       ^
2025-05-07T20:33:32.2668035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.2668528Z 
2025-05-07T20:33:32.2668979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.2669534Z 
2025-05-07T20:33:32.2669642Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.2670158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.2670577Z     T=2048,
2025-05-07T20:33:32.2670768Z     D=5120,
2025-05-07T20:33:32.2670963Z     scale_ub=None,
2025-05-07T20:33:32.2671170Z     contiguous=True,
2025-05-07T20:33:32.2671394Z     compiled=False,
2025-05-07T20:33:32.2671607Z )
2025-05-07T20:33:32.2671929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.2672455Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:32.2672741Z 
2025-05-07T20:33:32.2672825Z     @given(
2025-05-07T20:33:32.2673051Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.2673377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.2673754Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.2674101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.2674436Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.2674734Z     )
2025-05-07T20:33:32.2675095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.2675555Z     def test_silu_mul_quant(
2025-05-07T20:33:32.2675800Z         self,
2025-05-07T20:33:32.2676001Z         T: int,
2025-05-07T20:33:32.2676194Z         D: int,
2025-05-07T20:33:32.2676470Z         scale_ub: Optional[float],
2025-05-07T20:33:32.2676747Z         contiguous: bool,
2025-05-07T20:33:32.2676985Z         compiled: bool,
2025-05-07T20:33:32.2677211Z     ) -> None:
2025-05-07T20:33:32.2677431Z         torch.manual_seed(2025)
2025-05-07T20:33:32.2677678Z     
2025-05-07T20:33:32.2677958Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.2678318Z     
2025-05-07T20:33:32.2678514Z >       x_sign = torch.sign(x)
2025-05-07T20:33:32.2680771Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.2683072Z 
2025-05-07T20:33:32.2683193Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:32.2683421Z 
2025-05-07T20:33:32.2683524Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.2684026Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.2684443Z     T=16384,
2025-05-07T20:33:32.2684642Z     D=5120,
2025-05-07T20:33:32.2684840Z     scale_ub=None,
2025-05-07T20:33:32.2685051Z     contiguous=True,
2025-05-07T20:33:32.2685275Z     compiled=False,
2025-05-07T20:33:32.2685482Z )
2025-05-07T20:33:32.2685804Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.2686337Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:32.2686641Z 
2025-05-07T20:33:32.2686715Z     @given(
2025-05-07T20:33:32.2686953Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.2687279Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.2687595Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.2687935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.2688270Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.2688568Z     )
2025-05-07T20:33:32.2688929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.2689396Z     def test_silu_mul_quant(
2025-05-07T20:33:32.2689635Z         self,
2025-05-07T20:33:32.2689830Z         T: int,
2025-05-07T20:33:32.2690029Z         D: int,
2025-05-07T20:33:32.2690242Z         scale_ub: Optional[float],
2025-05-07T20:33:32.2690516Z         contiguous: bool,
2025-05-07T20:33:32.2690763Z         compiled: bool,
2025-05-07T20:33:32.2690982Z     ) -> None:
2025-05-07T20:33:32.2691198Z         torch.manual_seed(2025)
2025-05-07T20:33:32.2691449Z     
2025-05-07T20:33:32.2691722Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.2694031Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.2696100Z 
2025-05-07T20:33:32.2696218Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:32.2696444Z 
2025-05-07T20:33:32.2696550Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.2696978Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.2697391Z     T=4096,
2025-05-07T20:33:32.2697576Z     D=5120,
2025-05-07T20:33:32.2697830Z     scale_ub=None,
2025-05-07T20:33:32.2698037Z     contiguous=True,
2025-05-07T20:33:32.2698258Z     compiled=False,
2025-05-07T20:33:32.2698467Z )
2025-05-07T20:33:32.3733131Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.3733695Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:32.3733993Z 
2025-05-07T20:33:32.3734076Z     @given(
2025-05-07T20:33:32.3734314Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.3734634Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.3734950Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.3735292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.3735784Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.3736085Z     )
2025-05-07T20:33:32.3736453Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.3736921Z     def test_silu_mul_quant(
2025-05-07T20:33:32.3737170Z         self,
2025-05-07T20:33:32.3737366Z         T: int,
2025-05-07T20:33:32.3737565Z         D: int,
2025-05-07T20:33:32.3737781Z         scale_ub: Optional[float],
2025-05-07T20:33:32.3738146Z         contiguous: bool,
2025-05-07T20:33:32.3738399Z         compiled: bool,
2025-05-07T20:33:32.3738627Z     ) -> None:
2025-05-07T20:33:32.3738847Z         torch.manual_seed(2025)
2025-05-07T20:33:32.3739099Z     
2025-05-07T20:33:32.3739376Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.3741615Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.3743677Z 
2025-05-07T20:33:32.3743794Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:32.3744013Z 
2025-05-07T20:33:32.3744142Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.3744568Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.3744991Z     T=2048,
2025-05-07T20:33:32.3745168Z     D=5120,
2025-05-07T20:33:32.3745356Z     scale_ub=None,
2025-05-07T20:33:32.3745567Z     contiguous=False,
2025-05-07T20:33:32.3745785Z     compiled=False,
2025-05-07T20:33:32.3745988Z )
2025-05-07T20:33:32.3746313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.3746830Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:32.3747119Z 
2025-05-07T20:33:32.3747194Z     @given(
2025-05-07T20:33:32.3747421Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.3747741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.3748046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.3748387Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.3748725Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.3749012Z     )
2025-05-07T20:33:32.3749482Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.3750037Z     def test_silu_mul_quant(
2025-05-07T20:33:32.3750285Z         self,
2025-05-07T20:33:32.3750473Z         T: int,
2025-05-07T20:33:32.3750670Z         D: int,
2025-05-07T20:33:32.3750890Z         scale_ub: Optional[float],
2025-05-07T20:33:32.3751159Z         contiguous: bool,
2025-05-07T20:33:32.3751405Z         compiled: bool,
2025-05-07T20:33:32.3751626Z     ) -> None:
2025-05-07T20:33:32.3751837Z         torch.manual_seed(2025)
2025-05-07T20:33:32.3752081Z     
2025-05-07T20:33:32.3752450Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.3754684Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.3756732Z 
2025-05-07T20:33:32.3756848Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:32.3757118Z 
2025-05-07T20:33:32.3757221Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.3757647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.3758068Z     T=4096,
2025-05-07T20:33:32.3758250Z     D=7168,
2025-05-07T20:33:32.3758437Z     scale_ub=None,
2025-05-07T20:33:32.3758649Z     contiguous=True,
2025-05-07T20:33:32.3758864Z     compiled=True,
2025-05-07T20:33:32.3759114Z )
2025-05-07T20:33:32.3759443Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.3759949Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:32.3760238Z 
2025-05-07T20:33:32.3760313Z     @given(
2025-05-07T20:33:32.3760539Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.3760863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.3761170Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.3761508Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.3761846Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.3762135Z     )
2025-05-07T20:33:32.3762492Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.3762956Z     def test_silu_mul_quant(
2025-05-07T20:33:32.3763191Z         self,
2025-05-07T20:33:32.3763383Z         T: int,
2025-05-07T20:33:32.3763579Z         D: int,
2025-05-07T20:33:32.3763795Z         scale_ub: Optional[float],
2025-05-07T20:33:32.3764069Z         contiguous: bool,
2025-05-07T20:33:32.3764308Z         compiled: bool,
2025-05-07T20:33:32.3764524Z     ) -> None:
2025-05-07T20:33:32.3764740Z         torch.manual_seed(2025)
2025-05-07T20:33:32.3764985Z     
2025-05-07T20:33:32.3765259Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.3767500Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.3769567Z 
2025-05-07T20:33:32.3769682Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:32.3769903Z 
2025-05-07T20:33:32.3770003Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.3770474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.3772216Z     T=2048,
2025-05-07T20:33:32.3772397Z     D=5120,
2025-05-07T20:33:32.3772587Z     scale_ub=1200.0,
2025-05-07T20:33:32.3772803Z     contiguous=False,
2025-05-07T20:33:32.3773028Z     compiled=False,
2025-05-07T20:33:32.3773228Z )
2025-05-07T20:33:32.3773558Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.3774089Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:32.3774446Z 
2025-05-07T20:33:32.3774525Z     @given(
2025-05-07T20:33:32.3774761Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.3783639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.3784098Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.3784446Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.3784785Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.3785074Z     )
2025-05-07T20:33:32.3785437Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.3785910Z     def test_silu_mul_quant(
2025-05-07T20:33:32.3786156Z         self,
2025-05-07T20:33:32.3786347Z         T: int,
2025-05-07T20:33:32.3786532Z         D: int,
2025-05-07T20:33:32.3786875Z         scale_ub: Optional[float],
2025-05-07T20:33:32.3787150Z         contiguous: bool,
2025-05-07T20:33:32.3787392Z         compiled: bool,
2025-05-07T20:33:32.3787622Z     ) -> None:
2025-05-07T20:33:32.3787838Z         torch.manual_seed(2025)
2025-05-07T20:33:32.3788087Z     
2025-05-07T20:33:32.3788372Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.3790766Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.3792904Z 
2025-05-07T20:33:32.3793029Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:32.3793258Z 
2025-05-07T20:33:32.3793363Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.3793798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.3794224Z     T=4096,
2025-05-07T20:33:32.3794410Z     D=7168,
2025-05-07T20:33:32.3794605Z     scale_ub=1200.0,
2025-05-07T20:33:32.3794832Z     contiguous=True,
2025-05-07T20:33:32.3795055Z     compiled=False,
2025-05-07T20:33:32.3795265Z )
2025-05-07T20:33:32.3795593Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.3796111Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:32.3796405Z 
2025-05-07T20:33:32.3796481Z     @given(
2025-05-07T20:33:32.3796708Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.3797023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.3797343Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.3797683Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.3798019Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.3798309Z     )
2025-05-07T20:33:32.3798668Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.3799134Z     def test_silu_mul_quant(
2025-05-07T20:33:32.3799373Z         self,
2025-05-07T20:33:32.3799573Z         T: int,
2025-05-07T20:33:32.3799766Z         D: int,
2025-05-07T20:33:32.3799977Z         scale_ub: Optional[float],
2025-05-07T20:33:32.3800254Z         contiguous: bool,
2025-05-07T20:33:32.3800569Z         compiled: bool,
2025-05-07T20:33:32.3800785Z     ) -> None:
2025-05-07T20:33:32.3801001Z         torch.manual_seed(2025)
2025-05-07T20:33:32.3801242Z     
2025-05-07T20:33:32.3801511Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.3803762Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.3806446Z 
2025-05-07T20:33:32.3806563Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:32.3806786Z 
2025-05-07T20:33:32.3806892Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.3807317Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.3807731Z     T=16384,
2025-05-07T20:33:32.3807925Z     D=7168,
2025-05-07T20:33:32.3808115Z     scale_ub=None,
2025-05-07T20:33:32.3808320Z     contiguous=False,
2025-05-07T20:33:32.3808593Z     compiled=True,
2025-05-07T20:33:32.3808798Z )
2025-05-07T20:33:32.5106377Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.5106947Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:32.5107272Z 
2025-05-07T20:33:32.5107354Z     @given(
2025-05-07T20:33:32.5107595Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.5108132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.5108453Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.5108801Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.5109151Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.5109448Z     )
2025-05-07T20:33:32.5109980Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.5110457Z     def test_silu_mul_quant(
2025-05-07T20:33:32.5110703Z         self,
2025-05-07T20:33:32.5110903Z         T: int,
2025-05-07T20:33:32.5111114Z         D: int,
2025-05-07T20:33:32.5111333Z         scale_ub: Optional[float],
2025-05-07T20:33:32.5111619Z         contiguous: bool,
2025-05-07T20:33:32.5111865Z         compiled: bool,
2025-05-07T20:33:32.5112091Z     ) -> None:
2025-05-07T20:33:32.5112305Z         torch.manual_seed(2025)
2025-05-07T20:33:32.5112553Z     
2025-05-07T20:33:32.5112821Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.5115094Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.5117181Z 
2025-05-07T20:33:32.5117297Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:32.5117524Z 
2025-05-07T20:33:32.5117627Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.5118054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.5118467Z     T=4096,
2025-05-07T20:33:32.5118654Z     D=7168,
2025-05-07T20:33:32.5118848Z     scale_ub=None,
2025-05-07T20:33:32.5119055Z     contiguous=True,
2025-05-07T20:33:32.5119277Z     compiled=False,
2025-05-07T20:33:32.5119485Z )
2025-05-07T20:33:32.5119892Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.5120413Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:32.5120706Z 
2025-05-07T20:33:32.5120783Z     @given(
2025-05-07T20:33:32.5121010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.5121329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.5121642Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.5121976Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.5122414Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.5122729Z     )
2025-05-07T20:33:32.5123091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.5123555Z     def test_silu_mul_quant(
2025-05-07T20:33:32.5123798Z         self,
2025-05-07T20:33:32.5123988Z         T: int,
2025-05-07T20:33:32.5124184Z         D: int,
2025-05-07T20:33:32.5124396Z         scale_ub: Optional[float],
2025-05-07T20:33:32.5124674Z         contiguous: bool,
2025-05-07T20:33:32.5124919Z         compiled: bool,
2025-05-07T20:33:32.5125144Z     ) -> None:
2025-05-07T20:33:32.5125356Z         torch.manual_seed(2025)
2025-05-07T20:33:32.5125608Z     
2025-05-07T20:33:32.5125882Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.5128201Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.5130307Z 
2025-05-07T20:33:32.5130432Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:32.5130659Z 
2025-05-07T20:33:32.5130761Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.5131188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.5131614Z     T=16384,
2025-05-07T20:33:32.5131801Z     D=7168,
2025-05-07T20:33:32.5131996Z     scale_ub=None,
2025-05-07T20:33:32.5132210Z     contiguous=True,
2025-05-07T20:33:32.5132430Z     compiled=False,
2025-05-07T20:33:32.5132636Z )
2025-05-07T20:33:32.5132959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.5133475Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:32.5133774Z 
2025-05-07T20:33:32.5133851Z     @given(
2025-05-07T20:33:32.5134081Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.5134396Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.5134708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.5135045Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.5135381Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.5135670Z     )
2025-05-07T20:33:32.5136031Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.5136497Z     def test_silu_mul_quant(
2025-05-07T20:33:32.5136743Z         self,
2025-05-07T20:33:32.5136940Z         T: int,
2025-05-07T20:33:32.5137134Z         D: int,
2025-05-07T20:33:32.5137345Z         scale_ub: Optional[float],
2025-05-07T20:33:32.5137626Z         contiguous: bool,
2025-05-07T20:33:32.5137870Z         compiled: bool,
2025-05-07T20:33:32.5138088Z     ) -> None:
2025-05-07T20:33:32.5138303Z         torch.manual_seed(2025)
2025-05-07T20:33:32.5138548Z     
2025-05-07T20:33:32.5138847Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.5141171Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.5143239Z 
2025-05-07T20:33:32.5143355Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:32.5143618Z 
2025-05-07T20:33:32.5143722Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.5144146Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.5144564Z     T=16384,
2025-05-07T20:33:32.5144756Z     D=7168,
2025-05-07T20:33:32.5144948Z     scale_ub=1200.0,
2025-05-07T20:33:32.5145165Z     contiguous=True,
2025-05-07T20:33:32.5145388Z     compiled=False,
2025-05-07T20:33:32.5145597Z )
2025-05-07T20:33:32.5145915Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.5146440Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:32.5146741Z 
2025-05-07T20:33:32.5146822Z     @given(
2025-05-07T20:33:32.5147095Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.5147413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.5147730Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.5148079Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.5148414Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.5148717Z     )
2025-05-07T20:33:32.5149080Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.5149586Z     def test_silu_mul_quant(
2025-05-07T20:33:32.5149936Z         self,
2025-05-07T20:33:32.5150130Z         T: int,
2025-05-07T20:33:32.5150342Z         D: int,
2025-05-07T20:33:32.5150560Z         scale_ub: Optional[float],
2025-05-07T20:33:32.5150837Z         contiguous: bool,
2025-05-07T20:33:32.5151086Z         compiled: bool,
2025-05-07T20:33:32.5151306Z     ) -> None:
2025-05-07T20:33:32.5151525Z         torch.manual_seed(2025)
2025-05-07T20:33:32.5151771Z     
2025-05-07T20:33:32.5152043Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.5154288Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.5156355Z 
2025-05-07T20:33:32.5156474Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:32.5156702Z 
2025-05-07T20:33:32.5156803Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.5157229Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.5157644Z     T=128,
2025-05-07T20:33:32.5157836Z     D=5120,
2025-05-07T20:33:32.5158028Z     scale_ub=1200.0,
2025-05-07T20:33:32.5158248Z     contiguous=False,
2025-05-07T20:33:32.5158481Z     compiled=False,
2025-05-07T20:33:32.5158688Z )
2025-05-07T20:33:32.6789679Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.6790523Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:32.6790895Z 
2025-05-07T20:33:32.6790975Z     @given(
2025-05-07T20:33:32.6791206Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.6791529Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.6792181Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.6792526Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.6792865Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.6793154Z     )
2025-05-07T20:33:32.6793525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.6793992Z     def test_silu_mul_quant(
2025-05-07T20:33:32.6794232Z         self,
2025-05-07T20:33:32.6794425Z         T: int,
2025-05-07T20:33:32.6794621Z         D: int,
2025-05-07T20:33:32.6794922Z         scale_ub: Optional[float],
2025-05-07T20:33:32.6795191Z         contiguous: bool,
2025-05-07T20:33:32.6795440Z         compiled: bool,
2025-05-07T20:33:32.6795667Z     ) -> None:
2025-05-07T20:33:32.6795882Z         torch.manual_seed(2025)
2025-05-07T20:33:32.6796130Z     
2025-05-07T20:33:32.6796403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.6796752Z     
2025-05-07T20:33:32.6796949Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.6797249Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.6797569Z         x = x_sign * x_clamp
2025-05-07T20:33:32.6797819Z         x0 = x[:, :D]
2025-05-07T20:33:32.6798039Z         x1 = x[:, D:]
2025-05-07T20:33:32.6798242Z     
2025-05-07T20:33:32.6798522Z         if contiguous:
2025-05-07T20:33:32.6798765Z             x0 = x0.contiguous()
2025-05-07T20:33:32.6799026Z             x1 = x1.contiguous()
2025-05-07T20:33:32.6799272Z     
2025-05-07T20:33:32.6799470Z         if scale_ub is not None:
2025-05-07T20:33:32.6799745Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.6800098Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.6800420Z             )
2025-05-07T20:33:32.6800696Z         else:
2025-05-07T20:33:32.6800903Z             scale_ub_tensor = None
2025-05-07T20:33:32.6801162Z     
2025-05-07T20:33:32.6801396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.6801721Z             op = silu_mul_quant
2025-05-07T20:33:32.6801980Z             if compiled:
2025-05-07T20:33:32.6802233Z                 op = torch.compile(op)
2025-05-07T20:33:32.6802536Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.6802822Z     
2025-05-07T20:33:32.6803023Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:32.6803192Z 
2025-05-07T20:33:32.6803288Z moe/activation_test.py:117: 
2025-05-07T20:33:32.6803591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.6803940Z moe/activation_test.py:115: in fn
2025-05-07T20:33:32.6804227Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.6804966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:32.6805715Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:32.6806289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.6807017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.6807725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.6808294Z     kernel = self.compile(
2025-05-07T20:33:32.6808905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.6809621Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.6810039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.6810280Z 
2025-05-07T20:33:32.6810500Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd0b6310>
2025-05-07T20:33:32.6811727Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.6813241Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd00bca0>}
2025-05-07T20:33:32.6814715Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.6815868Z context = <triton._C.libtriton.ir.context object at 0x7f58fcf845b0>
2025-05-07T20:33:32.6816173Z 
2025-05-07T20:33:32.6816348Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.6816891Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.6817386Z                            module_map=module_map)
2025-05-07T20:33:32.6817769Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.6818130Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:32.6818392Z E       ^
2025-05-07T20:33:32.6818885Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.6819373Z 
2025-05-07T20:33:32.6819869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.6820427Z 
2025-05-07T20:33:32.6820529Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.6820961Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.6821385Z     T=2048,
2025-05-07T20:33:32.6821574Z     D=7168,
2025-05-07T20:33:32.6821759Z     scale_ub=None,
2025-05-07T20:33:32.6822023Z     contiguous=False,
2025-05-07T20:33:32.6822257Z     compiled=False,
2025-05-07T20:33:32.6822463Z )
2025-05-07T20:33:32.6822788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.6823312Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:32.6823601Z 
2025-05-07T20:33:32.6823679Z     @given(
2025-05-07T20:33:32.6823908Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.6824231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.6824545Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.6824888Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.6825234Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.6825534Z     )
2025-05-07T20:33:32.6825890Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.6826355Z     def test_silu_mul_quant(
2025-05-07T20:33:32.6826602Z         self,
2025-05-07T20:33:32.6826795Z         T: int,
2025-05-07T20:33:32.6826996Z         D: int,
2025-05-07T20:33:32.6827213Z         scale_ub: Optional[float],
2025-05-07T20:33:32.6827480Z         contiguous: bool,
2025-05-07T20:33:32.6827724Z         compiled: bool,
2025-05-07T20:33:32.6827952Z     ) -> None:
2025-05-07T20:33:32.6828160Z         torch.manual_seed(2025)
2025-05-07T20:33:32.6828406Z     
2025-05-07T20:33:32.6828679Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.6831043Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.6833104Z 
2025-05-07T20:33:32.6833227Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:32.6833445Z 
2025-05-07T20:33:32.6833597Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.6834026Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.6834447Z     T=128,
2025-05-07T20:33:32.6834626Z     D=7168,
2025-05-07T20:33:32.6834818Z     scale_ub=1200.0,
2025-05-07T20:33:32.6835044Z     contiguous=True,
2025-05-07T20:33:32.6835263Z     compiled=True,
2025-05-07T20:33:32.6835467Z )
2025-05-07T20:33:32.7286323Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.7287076Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:32.7287369Z 
2025-05-07T20:33:32.7287460Z     @given(
2025-05-07T20:33:32.7287703Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.7288030Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.7288350Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.7288702Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.7289044Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.7289348Z     )
2025-05-07T20:33:32.7289720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.7290188Z     def test_silu_mul_quant(
2025-05-07T20:33:32.7290445Z         self,
2025-05-07T20:33:32.7290723Z         T: int,
2025-05-07T20:33:32.7290919Z         D: int,
2025-05-07T20:33:32.7291141Z         scale_ub: Optional[float],
2025-05-07T20:33:32.7291420Z         contiguous: bool,
2025-05-07T20:33:32.7291661Z         compiled: bool,
2025-05-07T20:33:32.7291887Z     ) -> None:
2025-05-07T20:33:32.7292105Z         torch.manual_seed(2025)
2025-05-07T20:33:32.7292354Z     
2025-05-07T20:33:32.7292625Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.7293061Z     
2025-05-07T20:33:32.7293255Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.7293545Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.7293872Z         x = x_sign * x_clamp
2025-05-07T20:33:32.7294121Z         x0 = x[:, :D]
2025-05-07T20:33:32.7294333Z         x1 = x[:, D:]
2025-05-07T20:33:32.7294541Z     
2025-05-07T20:33:32.7294727Z         if contiguous:
2025-05-07T20:33:32.7294955Z             x0 = x0.contiguous()
2025-05-07T20:33:32.7295228Z             x1 = x1.contiguous()
2025-05-07T20:33:32.7295473Z     
2025-05-07T20:33:32.7295660Z         if scale_ub is not None:
2025-05-07T20:33:32.7295939Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.7296283Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.7296603Z             )
2025-05-07T20:33:32.7296798Z         else:
2025-05-07T20:33:32.7297010Z             scale_ub_tensor = None
2025-05-07T20:33:32.7297267Z     
2025-05-07T20:33:32.7297497Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.7297821Z             op = silu_mul_quant
2025-05-07T20:33:32.7298077Z             if compiled:
2025-05-07T20:33:32.7298330Z                 op = torch.compile(op)
2025-05-07T20:33:32.7298634Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.7298922Z     
2025-05-07T20:33:32.7299106Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:32.7299280Z 
2025-05-07T20:33:32.7299377Z moe/activation_test.py:117: 
2025-05-07T20:33:32.7299681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.7300017Z moe/activation_test.py:115: in fn
2025-05-07T20:33:32.7300304Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.7300895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:32.7301494Z     return fn(*args, **kwargs)
2025-05-07T20:33:32.7302193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:32.7302938Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:32.7303590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.7304319Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.7305032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.7305610Z     kernel = self.compile(
2025-05-07T20:33:32.7306186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.7306926Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.7307346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.7307587Z 
2025-05-07T20:33:32.7307812Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fcee5ee0>
2025-05-07T20:33:32.7309027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.7310707Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fcf3c0d0>}
2025-05-07T20:33:32.7312234Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.7313348Z context = <triton._C.libtriton.ir.context object at 0x7f58fceb9270>
2025-05-07T20:33:32.7313650Z 
2025-05-07T20:33:32.7313825Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.7314444Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.7314940Z                            module_map=module_map)
2025-05-07T20:33:32.7323986Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.7324387Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:32.7324661Z E       ^
2025-05-07T20:33:32.7325174Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.7325677Z 
2025-05-07T20:33:32.7326137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.7326699Z 
2025-05-07T20:33:32.7326820Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.7327250Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.7327686Z     T=128,
2025-05-07T20:33:32.7327886Z     D=7168,
2025-05-07T20:33:32.7328084Z     scale_ub=1200.0,
2025-05-07T20:33:32.7328316Z     contiguous=True,
2025-05-07T20:33:32.7328550Z     compiled=False,
2025-05-07T20:33:32.7328765Z )
2025-05-07T20:33:32.7329095Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.7329627Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:32.7329919Z 
2025-05-07T20:33:32.7330009Z     @given(
2025-05-07T20:33:32.7330241Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.7330575Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.7330901Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.7331247Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.7331603Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.7331911Z     )
2025-05-07T20:33:32.7332282Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.7332760Z     def test_silu_mul_quant(
2025-05-07T20:33:32.7333016Z         self,
2025-05-07T20:33:32.7333213Z         T: int,
2025-05-07T20:33:32.7333418Z         D: int,
2025-05-07T20:33:32.7333639Z         scale_ub: Optional[float],
2025-05-07T20:33:32.7334029Z         contiguous: bool,
2025-05-07T20:33:32.7334283Z         compiled: bool,
2025-05-07T20:33:32.7334507Z     ) -> None:
2025-05-07T20:33:32.7334728Z         torch.manual_seed(2025)
2025-05-07T20:33:32.7334982Z     
2025-05-07T20:33:32.7335258Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.7335620Z     
2025-05-07T20:33:32.7335820Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.7336112Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.7338380Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.7340494Z 
2025-05-07T20:33:32.7340612Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:32.7340840Z 
2025-05-07T20:33:32.7340943Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.7341415Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.7341838Z     T=128,
2025-05-07T20:33:32.7342029Z     D=5120,
2025-05-07T20:33:32.7342223Z     scale_ub=1200.0,
2025-05-07T20:33:32.7342448Z     contiguous=True,
2025-05-07T20:33:32.7342675Z     compiled=True,
2025-05-07T20:33:32.7342884Z )
2025-05-07T20:33:32.7343207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.7343781Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:32.7344071Z 
2025-05-07T20:33:32.7344151Z     @given(
2025-05-07T20:33:32.7344388Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.7344705Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.7345021Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.7345366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.7345703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.7346002Z     )
2025-05-07T20:33:32.7346366Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.7346828Z     def test_silu_mul_quant(
2025-05-07T20:33:32.7347081Z         self,
2025-05-07T20:33:32.7347276Z         T: int,
2025-05-07T20:33:32.7347471Z         D: int,
2025-05-07T20:33:32.7347693Z         scale_ub: Optional[float],
2025-05-07T20:33:32.7347977Z         contiguous: bool,
2025-05-07T20:33:32.7348222Z         compiled: bool,
2025-05-07T20:33:32.7348451Z     ) -> None:
2025-05-07T20:33:32.7348670Z         torch.manual_seed(2025)
2025-05-07T20:33:32.7348922Z     
2025-05-07T20:33:32.7349197Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.7349550Z     
2025-05-07T20:33:32.7349748Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.7350141Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.7352339Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.7354397Z 
2025-05-07T20:33:32.7354515Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:32.7354737Z 
2025-05-07T20:33:32.7354902Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.7355335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.7355757Z     T=128,
2025-05-07T20:33:32.7355948Z     D=7168,
2025-05-07T20:33:32.7356142Z     scale_ub=None,
2025-05-07T20:33:32.7356353Z     contiguous=True,
2025-05-07T20:33:32.7356576Z     compiled=True,
2025-05-07T20:33:32.7356785Z )
2025-05-07T20:33:32.9426705Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.9427269Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:32.9427813Z 
2025-05-07T20:33:32.9427896Z     @given(
2025-05-07T20:33:32.9428140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.9428461Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.9428787Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.9429130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.9429478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.9429912Z     )
2025-05-07T20:33:32.9430279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.9430748Z     def test_silu_mul_quant(
2025-05-07T20:33:32.9430986Z         self,
2025-05-07T20:33:32.9431185Z         T: int,
2025-05-07T20:33:32.9431474Z         D: int,
2025-05-07T20:33:32.9431691Z         scale_ub: Optional[float],
2025-05-07T20:33:32.9431973Z         contiguous: bool,
2025-05-07T20:33:32.9432217Z         compiled: bool,
2025-05-07T20:33:32.9432452Z     ) -> None:
2025-05-07T20:33:32.9432672Z         torch.manual_seed(2025)
2025-05-07T20:33:32.9432912Z     
2025-05-07T20:33:32.9433179Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9435531Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.9437612Z 
2025-05-07T20:33:32.9437729Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:32.9437954Z 
2025-05-07T20:33:32.9448178Z FAILED
2025-05-07T20:33:32.9448402Z 
2025-05-07T20:33:32.9448640Z =================================== FAILURES ===================================
2025-05-07T20:33:32.9449329Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:32.9449981Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:32.9450882Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:33:32.9451692Z   |     yield
2025-05-07T20:33:32.9452219Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run
2025-05-07T20:33:32.9452779Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:32.9453438Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod
2025-05-07T20:33:32.9454310Z   |     method()
2025-05-07T20:33:32.9455253Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:32.9456345Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.9457277Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:32.9458210Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:32.9459080Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:32.9459802Z   +-+---------------- 1 ----------------
2025-05-07T20:33:32.9460199Z     | Traceback (most recent call last):
2025-05-07T20:33:32.9461216Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:32.9462341Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9465346Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.9468317Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:32.9468933Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9469506Z     |     T=2048,
2025-05-07T20:33:32.9469981Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:32.9470528Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:32.9471050Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:32.9471551Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:32.9471980Z     | )
2025-05-07T20:33:32.9472218Z     | 
2025-05-07T20:33:32.9472951Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:32.9473884Z     +---------------- 2 ----------------
2025-05-07T20:33:32.9474290Z     | Traceback (most recent call last):
2025-05-07T20:33:32.9475308Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:32.9476423Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9479449Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.9482351Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:32.9483209Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9483643Z     |     T=128,
2025-05-07T20:33:32.9483843Z     |     D=7168,
2025-05-07T20:33:32.9484055Z     |     scale_ub=None,
2025-05-07T20:33:32.9484303Z     |     contiguous=True,
2025-05-07T20:33:32.9484545Z     |     compiled=True,
2025-05-07T20:33:32.9484769Z     | )
2025-05-07T20:33:32.9485820Z     | 
2025-05-07T20:33:32.9486438Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:32.9487084Z     +---------------- 3 ----------------
2025-05-07T20:33:32.9487385Z     | Traceback (most recent call last):
2025-05-07T20:33:32.9488143Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:32.9488979Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9491318Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:32.9493537Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:32.9493992Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9494423Z     |     T=128,
2025-05-07T20:33:32.9494624Z     |     D=5120,
2025-05-07T20:33:32.9494841Z     |     scale_ub=1200.0,
2025-05-07T20:33:32.9495091Z     |     contiguous=True,
2025-05-07T20:33:32.9495331Z     |     compiled=True,
2025-05-07T20:33:32.9495647Z     | )
2025-05-07T20:33:32.9495897Z     | 
2025-05-07T20:33:32.9496664Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:32.9497535Z     +---------------- 4 ----------------
2025-05-07T20:33:32.9497943Z     | Traceback (most recent call last):
2025-05-07T20:33:32.9499147Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:32.9500218Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:32.9501202Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:32.9502328Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.9503598Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:32.9504798Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:32.9505702Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:32.9506782Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.9507867Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:32.9509026Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:32.9510349Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:33:32.9511561Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:32.9512733Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:32.9513775Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:32.9514746Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:32.9515590Z     |     fn()
2025-05-07T20:33:32.9516425Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:32.9517380Z     |     self.fn.run(
2025-05-07T20:33:32.9518158Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:32.9519069Z     |     kernel = self.compile(
2025-05-07T20:33:32.9520002Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:32.9521029Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.9522090Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:32.9523281Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.9524046Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.9524558Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:32.9524992Z     | ^
2025-05-07T20:33:32.9525666Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.9526523Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:32.9527102Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:32.9527849Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9528489Z     |     T=1,  # or any other generated value
2025-05-07T20:33:32.9528942Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:32.9529423Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:32.9529933Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:32.9530516Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:32.9530949Z     | )
2025-05-07T20:33:32.9531182Z     | 
2025-05-07T20:33:32.9531925Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:32.9532803Z     +------------------------------------
2025-05-07T20:33:32.9533298Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:32.9533907Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.9534512Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9535106Z     T=1,
2025-05-07T20:33:32.9535363Z     D=5120,
2025-05-07T20:33:32.9535635Z     scale_ub=None,
2025-05-07T20:33:32.9535942Z     contiguous=True,
2025-05-07T20:33:32.9536251Z     compiled=True,
2025-05-07T20:33:32.9536543Z )
2025-05-07T20:33:32.9536997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.9537697Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:32.9538072Z 
2025-05-07T20:33:32.9538181Z     @given(
2025-05-07T20:33:32.9538487Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.9538951Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.9539362Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.9539820Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.9540265Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.9540646Z     )
2025-05-07T20:33:32.9541131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.9541764Z     def test_silu_mul_quant(
2025-05-07T20:33:32.9542099Z         self,
2025-05-07T20:33:32.9542374Z         T: int,
2025-05-07T20:33:32.9542647Z         D: int,
2025-05-07T20:33:32.9542947Z         scale_ub: Optional[float],
2025-05-07T20:33:32.9543335Z         contiguous: bool,
2025-05-07T20:33:32.9543665Z         compiled: bool,
2025-05-07T20:33:32.9543961Z     ) -> None:
2025-05-07T20:33:32.9544255Z         torch.manual_seed(2025)
2025-05-07T20:33:32.9544607Z     
2025-05-07T20:33:32.9544987Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9545490Z     
2025-05-07T20:33:32.9545752Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.9546159Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.9546591Z         x = x_sign * x_clamp
2025-05-07T20:33:32.9546923Z         x0 = x[:, :D]
2025-05-07T20:33:32.9547187Z         x1 = x[:, D:]
2025-05-07T20:33:32.9547554Z     
2025-05-07T20:33:32.9547785Z         if contiguous:
2025-05-07T20:33:32.9548100Z             x0 = x0.contiguous()
2025-05-07T20:33:32.9548457Z             x1 = x1.contiguous()
2025-05-07T20:33:32.9548803Z     
2025-05-07T20:33:32.9549116Z         if scale_ub is not None:
2025-05-07T20:33:32.9549506Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.9550142Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.9550591Z             )
2025-05-07T20:33:32.9550871Z         else:
2025-05-07T20:33:32.9551225Z             scale_ub_tensor = None
2025-05-07T20:33:32.9551593Z     
2025-05-07T20:33:32.9551910Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9552371Z             op = silu_mul_quant
2025-05-07T20:33:32.9552734Z             if compiled:
2025-05-07T20:33:32.9553089Z                 op = torch.compile(op)
2025-05-07T20:33:32.9553506Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9553907Z     
2025-05-07T20:33:32.9554175Z         y_fp8, y_scale = fn()
2025-05-07T20:33:32.9554578Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:32.9555001Z     
2025-05-07T20:33:32.9555331Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9555776Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:32.9556245Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:32.9556665Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:32.9557148Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.9557582Z     
2025-05-07T20:33:32.9557855Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:32.9558127Z 
2025-05-07T20:33:32.9558266Z moe/activation_test.py:126: 
2025-05-07T20:33:32.9558708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9559155Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:32.9559585Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.9560733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:32.9561835Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:32.9562611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.9563593Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.9564565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:32.9565566Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:32.9566654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:32.9567698Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:32.9568740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:32.9569648Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:32.9570527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:32.9571264Z     fn()
2025-05-07T20:33:32.9571974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:32.9572802Z     self.fn.run(
2025-05-07T20:33:32.9573454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.9574205Z     kernel = self.compile(
2025-05-07T20:33:32.9574968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.9575964Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.9576528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9576869Z 
2025-05-07T20:33:32.9577149Z self = <triton.compiler.compiler.ASTSource object at 0x7f59014b8040>
2025-05-07T20:33:32.9578691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.9580877Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f59015d99d0>}
2025-05-07T20:33:32.9583126Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.9584644Z context = <triton._C.libtriton.ir.context object at 0x7f5903432770>
2025-05-07T20:33:32.9585075Z 
2025-05-07T20:33:32.9585311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.9586084Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.9586935Z                            module_map=module_map)
2025-05-07T20:33:32.9587470Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.9587976Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:32.9588360Z E       ^
2025-05-07T20:33:32.9589037Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.9589733Z 
2025-05-07T20:33:32.9610252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.9611034Z 
2025-05-07T20:33:32.9611177Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.9611727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9612264Z     T=2048,
2025-05-07T20:33:32.9612501Z     D=5120,
2025-05-07T20:33:32.9612741Z     scale_ub=1200.0,
2025-05-07T20:33:32.9613017Z     contiguous=True,
2025-05-07T20:33:32.9613305Z     compiled=False,
2025-05-07T20:33:32.9613578Z )
2025-05-07T20:33:32.9613991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.9614670Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:32.9615044Z 
2025-05-07T20:33:32.9615145Z     @given(
2025-05-07T20:33:32.9615439Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.9615850Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.9616265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.9616703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.9617143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.9617534Z     )
2025-05-07T20:33:32.9618002Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.9618604Z     def test_silu_mul_quant(
2025-05-07T20:33:32.9618925Z         self,
2025-05-07T20:33:32.9619195Z         T: int,
2025-05-07T20:33:32.9619450Z         D: int,
2025-05-07T20:33:32.9619738Z         scale_ub: Optional[float],
2025-05-07T20:33:32.9620099Z         contiguous: bool,
2025-05-07T20:33:32.9620409Z         compiled: bool,
2025-05-07T20:33:32.9620712Z     ) -> None:
2025-05-07T20:33:32.9621010Z         torch.manual_seed(2025)
2025-05-07T20:33:32.9621353Z     
2025-05-07T20:33:32.9621729Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9622235Z     
2025-05-07T20:33:32.9622494Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.9622899Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.9623352Z         x = x_sign * x_clamp
2025-05-07T20:33:32.9623888Z         x0 = x[:, :D]
2025-05-07T20:33:32.9624214Z         x1 = x[:, D:]
2025-05-07T20:33:32.9624515Z     
2025-05-07T20:33:32.9624783Z         if contiguous:
2025-05-07T20:33:32.9625106Z             x0 = x0.contiguous()
2025-05-07T20:33:32.9625484Z             x1 = x1.contiguous()
2025-05-07T20:33:32.9625829Z     
2025-05-07T20:33:32.9626106Z         if scale_ub is not None:
2025-05-07T20:33:32.9626499Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.9626974Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.9627517Z             )
2025-05-07T20:33:32.9627787Z         else:
2025-05-07T20:33:32.9628068Z             scale_ub_tensor = None
2025-05-07T20:33:32.9628419Z     
2025-05-07T20:33:32.9628734Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9629180Z             op = silu_mul_quant
2025-05-07T20:33:32.9629530Z             if compiled:
2025-05-07T20:33:32.9630025Z                 op = torch.compile(op)
2025-05-07T20:33:32.9630457Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9630845Z     
2025-05-07T20:33:32.9631113Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:32.9631344Z 
2025-05-07T20:33:32.9631489Z moe/activation_test.py:117: 
2025-05-07T20:33:32.9631980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9632470Z moe/activation_test.py:115: in fn
2025-05-07T20:33:32.9632872Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9633887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:32.9634931Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:32.9635735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.9636820Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.9637826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.9638624Z     kernel = self.compile(
2025-05-07T20:33:32.9639428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.9640409Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.9640978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9641322Z 
2025-05-07T20:33:32.9641616Z self = <triton.compiler.compiler.ASTSource object at 0x7f58f13efac0>
2025-05-07T20:33:32.9643240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.9645370Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58dd9ad5e0>}
2025-05-07T20:33:32.9647407Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.9648988Z context = <triton._C.libtriton.ir.context object at 0x7f5900130a30>
2025-05-07T20:33:32.9649447Z 
2025-05-07T20:33:32.9649658Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.9650375Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.9651064Z                            module_map=module_map)
2025-05-07T20:33:32.9651589Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.9652098Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:32.9652461Z E       ^
2025-05-07T20:33:32.9653211Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.9653912Z 
2025-05-07T20:33:32.9654538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.9655317Z 
2025-05-07T20:33:32.9655472Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.9656073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9656671Z     T=2048,
2025-05-07T20:33:32.9656939Z     D=5120,
2025-05-07T20:33:32.9657255Z     scale_ub=1200.0,
2025-05-07T20:33:32.9657568Z     contiguous=True,
2025-05-07T20:33:32.9657876Z     compiled=True,
2025-05-07T20:33:32.9658153Z )
2025-05-07T20:33:32.9658610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.9659341Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:32.9659744Z 
2025-05-07T20:33:32.9659857Z     @given(
2025-05-07T20:33:32.9660172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.9660625Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.9661076Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.9661553Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.9662103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.9662529Z     )
2025-05-07T20:33:32.9663033Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.9663694Z     def test_silu_mul_quant(
2025-05-07T20:33:32.9664042Z         self,
2025-05-07T20:33:32.9664314Z         T: int,
2025-05-07T20:33:32.9664586Z         D: int,
2025-05-07T20:33:32.9664894Z         scale_ub: Optional[float],
2025-05-07T20:33:32.9665338Z         contiguous: bool,
2025-05-07T20:33:32.9665670Z         compiled: bool,
2025-05-07T20:33:32.9665988Z     ) -> None:
2025-05-07T20:33:32.9666291Z         torch.manual_seed(2025)
2025-05-07T20:33:32.9666629Z     
2025-05-07T20:33:32.9667019Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9667519Z     
2025-05-07T20:33:32.9667784Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.9668201Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.9668648Z         x = x_sign * x_clamp
2025-05-07T20:33:32.9668990Z         x0 = x[:, :D]
2025-05-07T20:33:32.9669297Z         x1 = x[:, D:]
2025-05-07T20:33:32.9669593Z     
2025-05-07T20:33:32.9669970Z         if contiguous:
2025-05-07T20:33:32.9670303Z             x0 = x0.contiguous()
2025-05-07T20:33:32.9670671Z             x1 = x1.contiguous()
2025-05-07T20:33:32.9671014Z     
2025-05-07T20:33:32.9671272Z         if scale_ub is not None:
2025-05-07T20:33:32.9671658Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.9672154Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.9672602Z             )
2025-05-07T20:33:32.9672877Z         else:
2025-05-07T20:33:32.9673181Z             scale_ub_tensor = None
2025-05-07T20:33:32.9673543Z     
2025-05-07T20:33:32.9673863Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9674316Z             op = silu_mul_quant
2025-05-07T20:33:32.9674662Z             if compiled:
2025-05-07T20:33:32.9675011Z                 op = torch.compile(op)
2025-05-07T20:33:32.9675439Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9675827Z     
2025-05-07T20:33:32.9676093Z         y_fp8, y_scale = fn()
2025-05-07T20:33:32.9676495Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:32.9676911Z     
2025-05-07T20:33:32.9677241Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9677725Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:32.9678153Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:32.9678616Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:32.9679181Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.9679705Z     
2025-05-07T20:33:32.9679971Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:32.9680258Z 
2025-05-07T20:33:32.9680396Z moe/activation_test.py:126: 
2025-05-07T20:33:32.9680826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9681316Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:32.9681784Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.9683206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:32.9684491Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:32.9685286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.9686336Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.9687389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:32.9688476Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:32.9689706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:32.9690841Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:32.9691943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:32.9692905Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:32.9693814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:32.9694689Z     fn()
2025-05-07T20:33:32.9695439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:32.9696299Z     self.fn.run(
2025-05-07T20:33:32.9696986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.9697775Z     kernel = self.compile(
2025-05-07T20:33:32.9698551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.9699503Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.9700076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9700414Z 
2025-05-07T20:33:32.9700710Z self = <triton.compiler.compiler.ASTSource object at 0x7f590018e8b0>
2025-05-07T20:33:32.9702333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.9704424Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f59000565e0>}
2025-05-07T20:33:32.9706464Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.9707889Z context = <triton._C.libtriton.ir.context object at 0x7f58ffed3f30>
2025-05-07T20:33:32.9708203Z 
2025-05-07T20:33:32.9708382Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.9708956Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.9709483Z                            module_map=module_map)
2025-05-07T20:33:32.9710001Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.9710372Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:32.9710764Z E       ^
2025-05-07T20:33:32.9711268Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.9711760Z 
2025-05-07T20:33:32.9712221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.9712778Z 
2025-05-07T20:33:32.9712879Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.9713308Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9713801Z     T=16384,
2025-05-07T20:33:32.9713990Z     D=7168,
2025-05-07T20:33:32.9714185Z     scale_ub=1200.0,
2025-05-07T20:33:32.9714408Z     contiguous=False,
2025-05-07T20:33:32.9714636Z     compiled=False,
2025-05-07T20:33:32.9714839Z )
2025-05-07T20:33:32.9715164Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.9715700Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:32.9716005Z 
2025-05-07T20:33:32.9716083Z     @given(
2025-05-07T20:33:32.9716314Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.9716632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.9716946Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.9717365Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.9717704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.9717991Z     )
2025-05-07T20:33:32.9718352Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.9718814Z     def test_silu_mul_quant(
2025-05-07T20:33:32.9719055Z         self,
2025-05-07T20:33:32.9719250Z         T: int,
2025-05-07T20:33:32.9719499Z         D: int,
2025-05-07T20:33:32.9719714Z         scale_ub: Optional[float],
2025-05-07T20:33:32.9719993Z         contiguous: bool,
2025-05-07T20:33:32.9720237Z         compiled: bool,
2025-05-07T20:33:32.9720458Z     ) -> None:
2025-05-07T20:33:32.9720676Z         torch.manual_seed(2025)
2025-05-07T20:33:32.9720920Z     
2025-05-07T20:33:32.9721187Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9721543Z     
2025-05-07T20:33:32.9721742Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.9722034Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.9722359Z         x = x_sign * x_clamp
2025-05-07T20:33:32.9722602Z         x0 = x[:, :D]
2025-05-07T20:33:32.9722821Z         x1 = x[:, D:]
2025-05-07T20:33:32.9723029Z     
2025-05-07T20:33:32.9723213Z         if contiguous:
2025-05-07T20:33:32.9723449Z             x0 = x0.contiguous()
2025-05-07T20:33:32.9723703Z             x1 = x1.contiguous()
2025-05-07T20:33:32.9723944Z     
2025-05-07T20:33:32.9724139Z         if scale_ub is not None:
2025-05-07T20:33:32.9724411Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.9724757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.9725079Z             )
2025-05-07T20:33:32.9725267Z         else:
2025-05-07T20:33:32.9725478Z             scale_ub_tensor = None
2025-05-07T20:33:32.9725737Z     
2025-05-07T20:33:32.9725962Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9726289Z             op = silu_mul_quant
2025-05-07T20:33:32.9726548Z             if compiled:
2025-05-07T20:33:32.9726792Z                 op = torch.compile(op)
2025-05-07T20:33:32.9727092Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9727377Z     
2025-05-07T20:33:32.9727563Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:32.9727738Z 
2025-05-07T20:33:32.9727834Z moe/activation_test.py:117: 
2025-05-07T20:33:32.9728134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9728483Z moe/activation_test.py:115: in fn
2025-05-07T20:33:32.9728763Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9729581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:32.9730337Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:32.9730901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.9731638Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.9732349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.9732965Z     kernel = self.compile(
2025-05-07T20:33:32.9733533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.9734232Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.9734648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9734893Z 
2025-05-07T20:33:32.9735114Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fff923d0>
2025-05-07T20:33:32.9736276Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.9737832Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fffea280>}
2025-05-07T20:33:32.9739305Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.9740456Z context = <triton._C.libtriton.ir.context object at 0x7f58ffa5d2f0>
2025-05-07T20:33:32.9740757Z 
2025-05-07T20:33:32.9740927Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.9741480Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.9741974Z                            module_map=module_map)
2025-05-07T20:33:32.9742350Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.9742703Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:32.9742973Z E       ^
2025-05-07T20:33:32.9743464Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.9743952Z 
2025-05-07T20:33:32.9744405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.9744958Z 
2025-05-07T20:33:32.9745060Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.9745490Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9745909Z     T=1,
2025-05-07T20:33:32.9746089Z     D=7168,
2025-05-07T20:33:32.9746289Z     scale_ub=None,
2025-05-07T20:33:32.9746505Z     contiguous=True,
2025-05-07T20:33:32.9746719Z     compiled=True,
2025-05-07T20:33:32.9746921Z )
2025-05-07T20:33:32.9747250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.9747747Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:32.9748027Z 
2025-05-07T20:33:32.9748105Z     @given(
2025-05-07T20:33:32.9748337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.9748659Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.9748970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.9749309Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.9749646Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.9750057Z     )
2025-05-07T20:33:32.9750417Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.9750882Z     def test_silu_mul_quant(
2025-05-07T20:33:32.9751178Z         self,
2025-05-07T20:33:32.9751373Z         T: int,
2025-05-07T20:33:32.9751565Z         D: int,
2025-05-07T20:33:32.9751777Z         scale_ub: Optional[float],
2025-05-07T20:33:32.9752053Z         contiguous: bool,
2025-05-07T20:33:32.9752296Z         compiled: bool,
2025-05-07T20:33:32.9752517Z     ) -> None:
2025-05-07T20:33:32.9752730Z         torch.manual_seed(2025)
2025-05-07T20:33:32.9752972Z     
2025-05-07T20:33:32.9753251Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9753647Z     
2025-05-07T20:33:32.9753842Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.9754136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.9754450Z         x = x_sign * x_clamp
2025-05-07T20:33:32.9754696Z         x0 = x[:, :D]
2025-05-07T20:33:32.9754915Z         x1 = x[:, D:]
2025-05-07T20:33:32.9755119Z     
2025-05-07T20:33:32.9755303Z         if contiguous:
2025-05-07T20:33:32.9755537Z             x0 = x0.contiguous()
2025-05-07T20:33:32.9755797Z             x1 = x1.contiguous()
2025-05-07T20:33:32.9756039Z     
2025-05-07T20:33:32.9756230Z         if scale_ub is not None:
2025-05-07T20:33:32.9756501Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.9756842Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.9757202Z             )
2025-05-07T20:33:32.9757387Z         else:
2025-05-07T20:33:32.9757597Z             scale_ub_tensor = None
2025-05-07T20:33:32.9757852Z     
2025-05-07T20:33:32.9758081Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9758397Z             op = silu_mul_quant
2025-05-07T20:33:32.9758653Z             if compiled:
2025-05-07T20:33:32.9758900Z                 op = torch.compile(op)
2025-05-07T20:33:32.9759241Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9759519Z     
2025-05-07T20:33:32.9759711Z         y_fp8, y_scale = fn()
2025-05-07T20:33:32.9759994Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:32.9760295Z     
2025-05-07T20:33:32.9760531Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9760869Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:32.9761170Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:32.9761499Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:32.9761865Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.9762177Z     
2025-05-07T20:33:32.9762379Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:32.9762583Z 
2025-05-07T20:33:32.9762691Z moe/activation_test.py:126: 
2025-05-07T20:33:32.9762987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9763339Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:32.9763677Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.9764515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:32.9765325Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:32.9765902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.9766635Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.9767364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:32.9768140Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:32.9768949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:32.9769753Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:32.9770586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:32.9771270Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:32.9771910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:32.9772457Z     fn()
2025-05-07T20:33:32.9772995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:32.9773614Z     self.fn.run(
2025-05-07T20:33:32.9774107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.9774706Z     kernel = self.compile(
2025-05-07T20:33:32.9775277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.9775974Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.9776390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9776637Z 
2025-05-07T20:33:32.9776850Z self = <triton.compiler.compiler.ASTSource object at 0x7f590004bb20>
2025-05-07T20:33:32.9778056Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.9779610Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fffff940>}
2025-05-07T20:33:32.9781070Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.9782212Z context = <triton._C.libtriton.ir.context object at 0x7f58ffb4ad70>
2025-05-07T20:33:32.9782519Z 
2025-05-07T20:33:32.9782691Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.9783509Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.9783997Z                            module_map=module_map)
2025-05-07T20:33:32.9784369Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.9784731Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:32.9785003Z E       ^
2025-05-07T20:33:32.9785483Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.9785980Z 
2025-05-07T20:33:32.9786425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.9786989Z 
2025-05-07T20:33:32.9787087Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.9787513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9787925Z     T=4096,
2025-05-07T20:33:32.9788110Z     D=5120,
2025-05-07T20:33:32.9788300Z     scale_ub=None,
2025-05-07T20:33:32.9788511Z     contiguous=False,
2025-05-07T20:33:32.9788739Z     compiled=False,
2025-05-07T20:33:32.9788966Z )
2025-05-07T20:33:32.9789309Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.9789905Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:32.9790199Z 
2025-05-07T20:33:32.9790275Z     @given(
2025-05-07T20:33:32.9790505Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.9790818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.9791130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.9791465Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.9791799Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.9792097Z     )
2025-05-07T20:33:32.9792545Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.9793062Z     def test_silu_mul_quant(
2025-05-07T20:33:32.9807415Z         self,
2025-05-07T20:33:32.9807690Z         T: int,
2025-05-07T20:33:32.9807915Z         D: int,
2025-05-07T20:33:32.9808158Z         scale_ub: Optional[float],
2025-05-07T20:33:32.9808457Z         contiguous: bool,
2025-05-07T20:33:32.9808720Z         compiled: bool,
2025-05-07T20:33:32.9808968Z     ) -> None:
2025-05-07T20:33:32.9809196Z         torch.manual_seed(2025)
2025-05-07T20:33:32.9809464Z     
2025-05-07T20:33:32.9809885Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9810264Z     
2025-05-07T20:33:32.9810478Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.9810795Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.9811129Z         x = x_sign * x_clamp
2025-05-07T20:33:32.9811396Z         x0 = x[:, :D]
2025-05-07T20:33:32.9811634Z         x1 = x[:, D:]
2025-05-07T20:33:32.9811890Z     
2025-05-07T20:33:32.9812161Z         if contiguous:
2025-05-07T20:33:32.9812417Z             x0 = x0.contiguous()
2025-05-07T20:33:32.9812702Z             x1 = x1.contiguous()
2025-05-07T20:33:32.9812957Z     
2025-05-07T20:33:32.9813164Z         if scale_ub is not None:
2025-05-07T20:33:32.9813464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.9813964Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.9814305Z             )
2025-05-07T20:33:32.9814517Z         else:
2025-05-07T20:33:32.9814743Z             scale_ub_tensor = None
2025-05-07T20:33:32.9815019Z     
2025-05-07T20:33:32.9815270Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9815609Z             op = silu_mul_quant
2025-05-07T20:33:32.9815949Z             if compiled:
2025-05-07T20:33:32.9816215Z                 op = torch.compile(op)
2025-05-07T20:33:32.9816524Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9816824Z     
2025-05-07T20:33:32.9817030Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:32.9817206Z 
2025-05-07T20:33:32.9817322Z moe/activation_test.py:117: 
2025-05-07T20:33:32.9817637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9818001Z moe/activation_test.py:115: in fn
2025-05-07T20:33:32.9818299Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9819109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:32.9819871Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:32.9820446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.9821191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.9821914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.9822490Z     kernel = self.compile(
2025-05-07T20:33:32.9823135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.9824056Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.9828882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9829133Z 
2025-05-07T20:33:32.9829358Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ffbf7a00>
2025-05-07T20:33:32.9830662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.9852871Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ffeb8430>}
2025-05-07T20:33:32.9854430Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.9855538Z context = <triton._C.libtriton.ir.context object at 0x7f58ffaa8bf0>
2025-05-07T20:33:32.9855847Z 
2025-05-07T20:33:32.9856018Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.9856563Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.9857095Z                            module_map=module_map)
2025-05-07T20:33:32.9857466Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.9857824Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:32.9858091Z E       ^
2025-05-07T20:33:32.9858581Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.9859118Z 
2025-05-07T20:33:32.9859572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.9860128Z 
2025-05-07T20:33:32.9860235Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.9860660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9861120Z     T=4096,
2025-05-07T20:33:32.9861308Z     D=7168,
2025-05-07T20:33:32.9861495Z     scale_ub=None,
2025-05-07T20:33:32.9861707Z     contiguous=False,
2025-05-07T20:33:32.9861927Z     compiled=False,
2025-05-07T20:33:32.9862135Z )
2025-05-07T20:33:32.9862456Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.9862973Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:32.9863310Z 
2025-05-07T20:33:32.9863388Z     @given(
2025-05-07T20:33:32.9863614Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.9863927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.9864244Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.9864581Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.9864914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.9865210Z     )
2025-05-07T20:33:32.9865569Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.9866031Z     def test_silu_mul_quant(
2025-05-07T20:33:32.9866271Z         self,
2025-05-07T20:33:32.9866455Z         T: int,
2025-05-07T20:33:32.9866650Z         D: int,
2025-05-07T20:33:32.9866860Z         scale_ub: Optional[float],
2025-05-07T20:33:32.9867127Z         contiguous: bool,
2025-05-07T20:33:32.9867366Z         compiled: bool,
2025-05-07T20:33:32.9867587Z     ) -> None:
2025-05-07T20:33:32.9867799Z         torch.manual_seed(2025)
2025-05-07T20:33:32.9868039Z     
2025-05-07T20:33:32.9868310Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9868667Z     
2025-05-07T20:33:32.9868864Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.9869155Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.9869473Z         x = x_sign * x_clamp
2025-05-07T20:33:32.9869711Z         x0 = x[:, :D]
2025-05-07T20:33:32.9870037Z         x1 = x[:, D:]
2025-05-07T20:33:32.9870252Z     
2025-05-07T20:33:32.9870445Z         if contiguous:
2025-05-07T20:33:32.9870676Z             x0 = x0.contiguous()
2025-05-07T20:33:32.9870944Z             x1 = x1.contiguous()
2025-05-07T20:33:32.9871191Z     
2025-05-07T20:33:32.9871388Z         if scale_ub is not None:
2025-05-07T20:33:32.9871663Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.9872011Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.9872340Z             )
2025-05-07T20:33:32.9872529Z         else:
2025-05-07T20:33:32.9872742Z             scale_ub_tensor = None
2025-05-07T20:33:32.9873006Z     
2025-05-07T20:33:32.9873288Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9873618Z             op = silu_mul_quant
2025-05-07T20:33:32.9873882Z             if compiled:
2025-05-07T20:33:32.9874129Z                 op = torch.compile(op)
2025-05-07T20:33:32.9874437Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9874729Z     
2025-05-07T20:33:32.9874921Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:32.9875096Z 
2025-05-07T20:33:32.9875196Z moe/activation_test.py:117: 
2025-05-07T20:33:32.9875501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9875893Z moe/activation_test.py:115: in fn
2025-05-07T20:33:32.9876175Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9876916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:32.9877668Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:32.9878234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.9879010Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.9879731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.9880301Z     kernel = self.compile(
2025-05-07T20:33:32.9880910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.9881615Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.9882033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9882275Z 
2025-05-07T20:33:32.9882494Z self = <triton.compiler.compiler.ASTSource object at 0x7f59000ac340>
2025-05-07T20:33:32.9884003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.9885517Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ffbdedc0>}
2025-05-07T20:33:32.9886994Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.9888105Z context = <triton._C.libtriton.ir.context object at 0x7f58ff5b5670>
2025-05-07T20:33:32.9888411Z 
2025-05-07T20:33:32.9888580Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.9889159Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.9889682Z                            module_map=module_map)
2025-05-07T20:33:32.9890065Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.9890427Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:32.9890695Z E       ^
2025-05-07T20:33:32.9891191Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.9891681Z 
2025-05-07T20:33:32.9892131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.9892693Z 
2025-05-07T20:33:32.9892796Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.9893229Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9893651Z     T=128,
2025-05-07T20:33:32.9893833Z     D=7168,
2025-05-07T20:33:32.9894029Z     scale_ub=None,
2025-05-07T20:33:32.9894251Z     contiguous=False,
2025-05-07T20:33:32.9894476Z     compiled=True,
2025-05-07T20:33:32.9894683Z )
2025-05-07T20:33:32.9895008Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.9895617Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:32.9895915Z 
2025-05-07T20:33:32.9895998Z     @given(
2025-05-07T20:33:32.9896237Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.9896570Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.9896890Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.9897241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.9897589Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.9897955Z     )
2025-05-07T20:33:32.9898316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.9898786Z     def test_silu_mul_quant(
2025-05-07T20:33:32.9899030Z         self,
2025-05-07T20:33:32.9899230Z         T: int,
2025-05-07T20:33:32.9899430Z         D: int,
2025-05-07T20:33:32.9899644Z         scale_ub: Optional[float],
2025-05-07T20:33:32.9899923Z         contiguous: bool,
2025-05-07T20:33:32.9900177Z         compiled: bool,
2025-05-07T20:33:32.9900399Z     ) -> None:
2025-05-07T20:33:32.9900618Z         torch.manual_seed(2025)
2025-05-07T20:33:32.9900868Z     
2025-05-07T20:33:32.9901140Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9901500Z     
2025-05-07T20:33:32.9901828Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.9902130Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.9902448Z         x = x_sign * x_clamp
2025-05-07T20:33:32.9902699Z         x0 = x[:, :D]
2025-05-07T20:33:32.9902923Z         x1 = x[:, D:]
2025-05-07T20:33:32.9903128Z     
2025-05-07T20:33:32.9903320Z         if contiguous:
2025-05-07T20:33:32.9903556Z             x0 = x0.contiguous()
2025-05-07T20:33:32.9903904Z             x1 = x1.contiguous()
2025-05-07T20:33:32.9904160Z     
2025-05-07T20:33:32.9904363Z         if scale_ub is not None:
2025-05-07T20:33:32.9904643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.9904994Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.9905320Z             )
2025-05-07T20:33:32.9905514Z         else:
2025-05-07T20:33:32.9905730Z             scale_ub_tensor = None
2025-05-07T20:33:32.9905988Z     
2025-05-07T20:33:32.9906219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9906546Z             op = silu_mul_quant
2025-05-07T20:33:32.9906804Z             if compiled:
2025-05-07T20:33:32.9907055Z                 op = torch.compile(op)
2025-05-07T20:33:32.9907359Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9907643Z     
2025-05-07T20:33:32.9907841Z         y_fp8, y_scale = fn()
2025-05-07T20:33:32.9908129Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:32.9908435Z     
2025-05-07T20:33:32.9908676Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9909019Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:32.9909326Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:32.9909655Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:32.9910122Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.9910449Z     
2025-05-07T20:33:32.9910653Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:32.9910857Z 
2025-05-07T20:33:32.9910964Z moe/activation_test.py:126: 
2025-05-07T20:33:32.9911265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9911614Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:32.9911951Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.9912795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:32.9913623Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:32.9914252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.9914991Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.9915720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:32.9916494Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:32.9917302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:32.9918148Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:32.9918925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:32.9919611Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:32.9920253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:32.9920801Z     fn()
2025-05-07T20:33:32.9921339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:32.9921964Z     self.fn.run(
2025-05-07T20:33:32.9922504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.9923065Z     kernel = self.compile(
2025-05-07T20:33:32.9923641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.9924338Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.9924747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9925044Z 
2025-05-07T20:33:32.9925255Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff910940>
2025-05-07T20:33:32.9926428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.9927937Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ff752160>}
2025-05-07T20:33:32.9929455Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.9930561Z context = <triton._C.libtriton.ir.context object at 0x7f58ff4d9270>
2025-05-07T20:33:32.9930870Z 
2025-05-07T20:33:32.9931046Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.9931607Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.9932097Z                            module_map=module_map)
2025-05-07T20:33:32.9932475Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.9932843Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:32.9933113Z E       ^
2025-05-07T20:33:32.9933605Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.9934095Z 
2025-05-07T20:33:32.9934552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.9935112Z 
2025-05-07T20:33:32.9935220Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.9935644Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9936067Z     T=128,
2025-05-07T20:33:32.9936258Z     D=7168,
2025-05-07T20:33:32.9936447Z     scale_ub=None,
2025-05-07T20:33:32.9936665Z     contiguous=False,
2025-05-07T20:33:32.9936898Z     compiled=False,
2025-05-07T20:33:32.9937100Z )
2025-05-07T20:33:32.9937475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.9937992Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:32.9938274Z 
2025-05-07T20:33:32.9938355Z     @given(
2025-05-07T20:33:32.9938581Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.9938906Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.9939263Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.9939640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.9939981Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.9940279Z     )
2025-05-07T20:33:32.9940629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.9941096Z     def test_silu_mul_quant(
2025-05-07T20:33:32.9941340Z         self,
2025-05-07T20:33:32.9941529Z         T: int,
2025-05-07T20:33:32.9941732Z         D: int,
2025-05-07T20:33:32.9941956Z         scale_ub: Optional[float],
2025-05-07T20:33:32.9942224Z         contiguous: bool,
2025-05-07T20:33:32.9942467Z         compiled: bool,
2025-05-07T20:33:32.9942688Z     ) -> None:
2025-05-07T20:33:32.9942902Z         torch.manual_seed(2025)
2025-05-07T20:33:32.9943141Z     
2025-05-07T20:33:32.9943465Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9943821Z     
2025-05-07T20:33:32.9944007Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.9944303Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.9944621Z         x = x_sign * x_clamp
2025-05-07T20:33:32.9944862Z         x0 = x[:, :D]
2025-05-07T20:33:32.9945078Z         x1 = x[:, D:]
2025-05-07T20:33:32.9945288Z     
2025-05-07T20:33:32.9945516Z         if contiguous:
2025-05-07T20:33:32.9945746Z             x0 = x0.contiguous()
2025-05-07T20:33:32.9946008Z             x1 = x1.contiguous()
2025-05-07T20:33:32.9946247Z     
2025-05-07T20:33:32.9946440Z         if scale_ub is not None:
2025-05-07T20:33:32.9946716Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.9947052Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.9947373Z             )
2025-05-07T20:33:32.9947565Z         else:
2025-05-07T20:33:32.9947777Z             scale_ub_tensor = None
2025-05-07T20:33:32.9948035Z     
2025-05-07T20:33:32.9948261Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9948587Z             op = silu_mul_quant
2025-05-07T20:33:32.9948839Z             if compiled:
2025-05-07T20:33:32.9949088Z                 op = torch.compile(op)
2025-05-07T20:33:32.9949393Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9949673Z     
2025-05-07T20:33:32.9949942Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:32.9950114Z 
2025-05-07T20:33:32.9950218Z moe/activation_test.py:117: 
2025-05-07T20:33:32.9950512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9950861Z moe/activation_test.py:115: in fn
2025-05-07T20:33:32.9951145Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9951888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:32.9952627Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:32.9953194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.9953926Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.9954628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.9955194Z     kernel = self.compile(
2025-05-07T20:33:32.9955767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.9956518Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.9956927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9957173Z 
2025-05-07T20:33:32.9957384Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff562790>
2025-05-07T20:33:32.9958547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.9960093Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ff6dd940>}
2025-05-07T20:33:32.9961553Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.9962662Z context = <triton._C.libtriton.ir.context object at 0x7f58ff4a95f0>
2025-05-07T20:33:32.9962971Z 
2025-05-07T20:33:32.9963138Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.9963687Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.9964213Z                            module_map=module_map)
2025-05-07T20:33:32.9964591Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.9964954Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:32.9965219Z E       ^
2025-05-07T20:33:32.9965700Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.9966191Z 
2025-05-07T20:33:32.9966636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.9967232Z 
2025-05-07T20:33:32.9967340Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.9967761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.9968181Z     T=4096,
2025-05-07T20:33:32.9968371Z     D=5120,
2025-05-07T20:33:32.9968564Z     scale_ub=1200.0,
2025-05-07T20:33:32.9968779Z     contiguous=True,
2025-05-07T20:33:32.9968998Z     compiled=False,
2025-05-07T20:33:32.9969206Z )
2025-05-07T20:33:32.9969526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.9970047Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:32.9970337Z 
2025-05-07T20:33:32.9970430Z     @given(
2025-05-07T20:33:32.9970654Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.9970972Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.9971292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.9971626Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.9971964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.9972262Z     )
2025-05-07T20:33:32.9972622Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.9973079Z     def test_silu_mul_quant(
2025-05-07T20:33:32.9973326Z         self,
2025-05-07T20:33:32.9973524Z         T: int,
2025-05-07T20:33:32.9973714Z         D: int,
2025-05-07T20:33:32.9973933Z         scale_ub: Optional[float],
2025-05-07T20:33:32.9974209Z         contiguous: bool,
2025-05-07T20:33:32.9974446Z         compiled: bool,
2025-05-07T20:33:32.9974672Z     ) -> None:
2025-05-07T20:33:32.9974891Z         torch.manual_seed(2025)
2025-05-07T20:33:32.9975127Z     
2025-05-07T20:33:32.9975399Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.9975753Z     
2025-05-07T20:33:32.9975943Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.9976236Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.9976553Z         x = x_sign * x_clamp
2025-05-07T20:33:32.9976836Z         x0 = x[:, :D]
2025-05-07T20:33:32.9977059Z         x1 = x[:, D:]
2025-05-07T20:33:32.9977269Z     
2025-05-07T20:33:32.9977451Z         if contiguous:
2025-05-07T20:33:32.9977677Z             x0 = x0.contiguous()
2025-05-07T20:33:32.9977939Z             x1 = x1.contiguous()
2025-05-07T20:33:32.9978178Z     
2025-05-07T20:33:32.9978365Z         if scale_ub is not None:
2025-05-07T20:33:32.9978641Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.9978982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.9979366Z             )
2025-05-07T20:33:32.9979557Z         else:
2025-05-07T20:33:32.9979762Z             scale_ub_tensor = None
2025-05-07T20:33:32.9980012Z     
2025-05-07T20:33:32.9980242Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.9980567Z             op = silu_mul_quant
2025-05-07T20:33:32.9980817Z             if compiled:
2025-05-07T20:33:32.9981066Z                 op = torch.compile(op)
2025-05-07T20:33:32.9981376Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9981654Z     
2025-05-07T20:33:32.9981851Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:32.9982023Z 
2025-05-07T20:33:32.9982118Z moe/activation_test.py:117: 
2025-05-07T20:33:32.9982423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9983029Z moe/activation_test.py:115: in fn
2025-05-07T20:33:32.9987980Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.9988736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:32.9989486Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:32.9990171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.9991027Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.9991741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.9992305Z     kernel = self.compile(
2025-05-07T20:33:32.9992878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.9993581Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.9994002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.9994341Z 
2025-05-07T20:33:32.9994610Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff6f2ac0>
2025-05-07T20:33:32.9995848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.9997352Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ff73a8b0>}
2025-05-07T20:33:32.9998810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.9999913Z context = <triton._C.libtriton.ir.context object at 0x7f58ff03a1f0>
2025-05-07T20:33:33.0000215Z 
2025-05-07T20:33:33.0000390Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0000936Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0001425Z                            module_map=module_map)
2025-05-07T20:33:33.0001797Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0002155Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0002415Z E       ^
2025-05-07T20:33:33.0002986Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0003476Z 
2025-05-07T20:33:33.0003926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0004480Z 
2025-05-07T20:33:33.0004585Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0005013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0005430Z     T=1,
2025-05-07T20:33:33.0005611Z     D=5120,
2025-05-07T20:33:33.0005795Z     scale_ub=None,
2025-05-07T20:33:33.0006073Z     contiguous=True,
2025-05-07T20:33:33.0006292Z     compiled=True,
2025-05-07T20:33:33.0006486Z )
2025-05-07T20:33:33.0006810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0007315Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:33.0007585Z 
2025-05-07T20:33:33.0007660Z     @given(
2025-05-07T20:33:33.0007888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0008209Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0008519Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0008848Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0009181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0009512Z     )
2025-05-07T20:33:33.0009865Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0010324Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0010566Z         self,
2025-05-07T20:33:33.0010751Z         T: int,
2025-05-07T20:33:33.0010943Z         D: int,
2025-05-07T20:33:33.0011159Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0011249Z         contiguous: bool,
2025-05-07T20:33:33.0011376Z         compiled: bool,
2025-05-07T20:33:33.0011460Z     ) -> None:
2025-05-07T20:33:33.0011557Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0011638Z     
2025-05-07T20:33:33.0011815Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0011887Z     
2025-05-07T20:33:33.0011984Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0012108Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0012194Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0012279Z         x0 = x[:, :D]
2025-05-07T20:33:33.0012359Z         x1 = x[:, D:]
2025-05-07T20:33:33.0012432Z     
2025-05-07T20:33:33.0012521Z         if contiguous:
2025-05-07T20:33:33.0012615Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0012706Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0012785Z     
2025-05-07T20:33:33.0012873Z         if scale_ub is not None:
2025-05-07T20:33:33.0012982Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0013121Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0013194Z             )
2025-05-07T20:33:33.0013272Z         else:
2025-05-07T20:33:33.0013366Z             scale_ub_tensor = None
2025-05-07T20:33:33.0013441Z     
2025-05-07T20:33:33.0013582Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0013673Z             op = silu_mul_quant
2025-05-07T20:33:33.0013756Z             if compiled:
2025-05-07T20:33:33.0013858Z                 op = torch.compile(op)
2025-05-07T20:33:33.0013966Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0014037Z     
2025-05-07T20:33:33.0014131Z         y_fp8, y_scale = fn()
2025-05-07T20:33:33.0014252Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:33.0014331Z     
2025-05-07T20:33:33.0014465Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0014564Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:33.0014667Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:33.0014790Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:33.0014932Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0015005Z     
2025-05-07T20:33:33.0015149Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:33.0015154Z 
2025-05-07T20:33:33.0015254Z moe/activation_test.py:126: 
2025-05-07T20:33:33.0015390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0015493Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:33.0015634Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0016243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:33.0016382Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:33.0016774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0017009Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0017408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:33.0017677Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0018105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:33.0018410Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0018815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:33.0018987Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:33.0019357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:33.0019476Z     fn()
2025-05-07T20:33:33.0019912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:33.0019997Z     self.fn.run(
2025-05-07T20:33:33.0020357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0020455Z     kernel = self.compile(
2025-05-07T20:33:33.0020867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0021044Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0021182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0021188Z 
2025-05-07T20:33:33.0021399Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff086520>
2025-05-07T20:33:33.0022255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0022806Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ff44e550>}
2025-05-07T20:33:33.0023628Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0023825Z context = <triton._C.libtriton.ir.context object at 0x7f58ff4526f0>
2025-05-07T20:33:33.0023830Z 
2025-05-07T20:33:33.0024000Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0024280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0024384Z                            module_map=module_map)
2025-05-07T20:33:33.0024555Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0024654Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:33.0024728Z E       ^
2025-05-07T20:33:33.0025157Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0025163Z 
2025-05-07T20:33:33.0025611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0025616Z 
2025-05-07T20:33:33.0025718Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0025950Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0026068Z     T=2048,
2025-05-07T20:33:33.0026146Z     D=5120,
2025-05-07T20:33:33.0026225Z     scale_ub=None,
2025-05-07T20:33:33.0026309Z     contiguous=True,
2025-05-07T20:33:33.0026395Z     compiled=True,
2025-05-07T20:33:33.0026468Z )
2025-05-07T20:33:33.0026695Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0026873Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:33.0026878Z 
2025-05-07T20:33:33.0026956Z     @given(
2025-05-07T20:33:33.0027075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0027174Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0027288Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0027451Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0027564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0027635Z     )
2025-05-07T20:33:33.0027896Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0027990Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0028066Z         self,
2025-05-07T20:33:33.0028149Z         T: int,
2025-05-07T20:33:33.0028223Z         D: int,
2025-05-07T20:33:33.0028360Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0028453Z         contiguous: bool,
2025-05-07T20:33:33.0028536Z         compiled: bool,
2025-05-07T20:33:33.0028614Z     ) -> None:
2025-05-07T20:33:33.0028712Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0028787Z     
2025-05-07T20:33:33.0028960Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0029034Z     
2025-05-07T20:33:33.0029123Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0029247Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0029337Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0029414Z         x0 = x[:, :D]
2025-05-07T20:33:33.0029500Z         x1 = x[:, D:]
2025-05-07T20:33:33.0029570Z     
2025-05-07T20:33:33.0029656Z         if contiguous:
2025-05-07T20:33:33.0029905Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0030004Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0030076Z     
2025-05-07T20:33:33.0030170Z         if scale_ub is not None:
2025-05-07T20:33:33.0030275Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0030411Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0030486Z             )
2025-05-07T20:33:33.0030560Z         else:
2025-05-07T20:33:33.0030657Z             scale_ub_tensor = None
2025-05-07T20:33:33.0030726Z     
2025-05-07T20:33:33.0030854Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0030946Z             op = silu_mul_quant
2025-05-07T20:33:33.0031028Z             if compiled:
2025-05-07T20:33:33.0031128Z                 op = torch.compile(op)
2025-05-07T20:33:33.0031235Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0031308Z     
2025-05-07T20:33:33.0031396Z         y_fp8, y_scale = fn()
2025-05-07T20:33:33.0031522Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:33.0031595Z     
2025-05-07T20:33:33.0031732Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0031834Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:33.0031931Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:33.0032054Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:33.0032269Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0032344Z     
2025-05-07T20:33:33.0032444Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:33.0032449Z 
2025-05-07T20:33:33.0032546Z moe/activation_test.py:126: 
2025-05-07T20:33:33.0032678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0032786Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:33.0032920Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0033581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:33.0033678Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:33.0034063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0034298Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0034693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:33.0034965Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0035432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:33.0035698Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0036104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:33.0036275Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:33.0036676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:33.0036755Z     fn()
2025-05-07T20:33:33.0037185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:33.0037271Z     self.fn.run(
2025-05-07T20:33:33.0037627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0037720Z     kernel = self.compile(
2025-05-07T20:33:33.0038131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0038306Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0038437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0038446Z 
2025-05-07T20:33:33.0038655Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff3f8490>
2025-05-07T20:33:33.0039509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0040056Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fefa7f70>}
2025-05-07T20:33:33.0040870Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0041071Z context = <triton._C.libtriton.ir.context object at 0x7f58ff18f1f0>
2025-05-07T20:33:33.0041078Z 
2025-05-07T20:33:33.0041245Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0041519Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0041631Z                            module_map=module_map)
2025-05-07T20:33:33.0041791Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0041937Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:33.0042017Z E       ^
2025-05-07T20:33:33.0042401Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0042406Z 
2025-05-07T20:33:33.0042859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0042864Z 
2025-05-07T20:33:33.0042962Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0043243Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0043324Z     T=128,
2025-05-07T20:33:33.0043400Z     D=5120,
2025-05-07T20:33:33.0043484Z     scale_ub=None,
2025-05-07T20:33:33.0043568Z     contiguous=True,
2025-05-07T20:33:33.0043653Z     compiled=True,
2025-05-07T20:33:33.0043725Z )
2025-05-07T20:33:33.0043949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0044126Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:33.0044130Z 
2025-05-07T20:33:33.0044212Z     @given(
2025-05-07T20:33:33.0044331Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0044429Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0044591Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0044708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0044824Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0044899Z     )
2025-05-07T20:33:33.0045156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0045246Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0045321Z         self,
2025-05-07T20:33:33.0045436Z         T: int,
2025-05-07T20:33:33.0045513Z         D: int,
2025-05-07T20:33:33.0045615Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0045702Z         contiguous: bool,
2025-05-07T20:33:33.0045796Z         compiled: bool,
2025-05-07T20:33:33.0045872Z     ) -> None:
2025-05-07T20:33:33.0045967Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0046041Z     
2025-05-07T20:33:33.0046211Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0046288Z     
2025-05-07T20:33:33.0046378Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0046501Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0046592Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0046670Z         x0 = x[:, :D]
2025-05-07T20:33:33.0046748Z         x1 = x[:, D:]
2025-05-07T20:33:33.0046822Z     
2025-05-07T20:33:33.0046904Z         if contiguous:
2025-05-07T20:33:33.0046994Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0047086Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0047162Z     
2025-05-07T20:33:33.0047253Z         if scale_ub is not None:
2025-05-07T20:33:33.0047363Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0047500Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0047578Z             )
2025-05-07T20:33:33.0047650Z         else:
2025-05-07T20:33:33.0047743Z             scale_ub_tensor = None
2025-05-07T20:33:33.0047818Z     
2025-05-07T20:33:33.0047945Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0048035Z             op = silu_mul_quant
2025-05-07T20:33:33.0048128Z             if compiled:
2025-05-07T20:33:33.0048226Z                 op = torch.compile(op)
2025-05-07T20:33:33.0048332Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0048406Z     
2025-05-07T20:33:33.0048495Z         y_fp8, y_scale = fn()
2025-05-07T20:33:33.0048613Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:33.0048690Z     
2025-05-07T20:33:33.0048831Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0048932Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:33.0049029Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:33.0049200Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:33.0049348Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0049417Z     
2025-05-07T20:33:33.0049516Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:33.0049520Z 
2025-05-07T20:33:33.0049621Z moe/activation_test.py:126: 
2025-05-07T20:33:33.0049753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0049857Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:33.0050035Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0050642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:33.0050747Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:33.0051131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0051363Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0051757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:33.0052060Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0052490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:33.0052757Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0053157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:33.0053371Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:33.0053735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:33.0053813Z     fn()
2025-05-07T20:33:33.0054244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:33.0054326Z     self.fn.run(
2025-05-07T20:33:33.0054690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0054780Z     kernel = self.compile(
2025-05-07T20:33:33.0055187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0055371Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0055501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0055508Z 
2025-05-07T20:33:33.0055719Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff2b9bb0>
2025-05-07T20:33:33.0056565Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0057114Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ff2d2b80>}
2025-05-07T20:33:33.0057929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0058126Z context = <triton._C.libtriton.ir.context object at 0x7f58fed4cdb0>
2025-05-07T20:33:33.0058130Z 
2025-05-07T20:33:33.0058302Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0058581Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0058686Z                            module_map=module_map)
2025-05-07T20:33:33.0058894Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0059020Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:33.0059107Z E       ^
2025-05-07T20:33:33.0059502Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0059510Z 
2025-05-07T20:33:33.0059954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0059997Z 
2025-05-07T20:33:33.0060100Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0060326Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0060398Z     T=4096,
2025-05-07T20:33:33.0060488Z     D=5120,
2025-05-07T20:33:33.0060569Z     scale_ub=None,
2025-05-07T20:33:33.0060651Z     contiguous=True,
2025-05-07T20:33:33.0060735Z     compiled=True,
2025-05-07T20:33:33.0060805Z )
2025-05-07T20:33:33.0061032Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0061206Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:33.0061211Z 
2025-05-07T20:33:33.0061282Z     @given(
2025-05-07T20:33:33.0061398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0061541Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0061656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0061777Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0061891Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0061960Z     )
2025-05-07T20:33:33.0062218Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0062383Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0062458Z         self,
2025-05-07T20:33:33.0062536Z         T: int,
2025-05-07T20:33:33.0062610Z         D: int,
2025-05-07T20:33:33.0062707Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0062802Z         contiguous: bool,
2025-05-07T20:33:33.0062884Z         compiled: bool,
2025-05-07T20:33:33.0062958Z     ) -> None:
2025-05-07T20:33:33.0063053Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0063121Z     
2025-05-07T20:33:33.0063297Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0063372Z     
2025-05-07T20:33:33.0063461Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0063585Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0063673Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0063750Z         x0 = x[:, :D]
2025-05-07T20:33:33.0063830Z         x1 = x[:, D:]
2025-05-07T20:33:33.0063902Z     
2025-05-07T20:33:33.0063986Z         if contiguous:
2025-05-07T20:33:33.0064083Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0064170Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0064240Z     
2025-05-07T20:33:33.0064331Z         if scale_ub is not None:
2025-05-07T20:33:33.0064435Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0064572Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0064646Z             )
2025-05-07T20:33:33.0064723Z         else:
2025-05-07T20:33:33.0064816Z             scale_ub_tensor = None
2025-05-07T20:33:33.0064887Z     
2025-05-07T20:33:33.0065015Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0065103Z             op = silu_mul_quant
2025-05-07T20:33:33.0065186Z             if compiled:
2025-05-07T20:33:33.0065286Z                 op = torch.compile(op)
2025-05-07T20:33:33.0065392Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0065463Z     
2025-05-07T20:33:33.0065554Z         y_fp8, y_scale = fn()
2025-05-07T20:33:33.0065682Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:33.0065749Z     
2025-05-07T20:33:33.0065884Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0066033Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:33.0066133Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:33.0066259Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:33.0066399Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0066471Z     
2025-05-07T20:33:33.0066574Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:33.0066579Z 
2025-05-07T20:33:33.0066674Z moe/activation_test.py:126: 
2025-05-07T20:33:33.0066802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0066950Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:33.0067084Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0067693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:33.0067792Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:33.0068181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0068415Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0068846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:33.0069118Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0069545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:33.0069926Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0070332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:33.0070545Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:33.0070913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:33.0070997Z     fn()
2025-05-07T20:33:33.0071426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:33.0071509Z     self.fn.run(
2025-05-07T20:33:33.0071870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0071960Z     kernel = self.compile(
2025-05-07T20:33:33.0072372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0072548Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0072679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0072688Z 
2025-05-07T20:33:33.0072897Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff11c970>
2025-05-07T20:33:33.0073744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0074294Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58feddf5e0>}
2025-05-07T20:33:33.0075104Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0075303Z context = <triton._C.libtriton.ir.context object at 0x7f58fe9787f0>
2025-05-07T20:33:33.0075311Z 
2025-05-07T20:33:33.0075478Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0075794Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0075901Z                            module_map=module_map)
2025-05-07T20:33:33.0076064Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0076165Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:33.0076241Z E       ^
2025-05-07T20:33:33.0076623Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0076628Z 
2025-05-07T20:33:33.0077073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0077117Z 
2025-05-07T20:33:33.0077217Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0077446Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0077528Z     T=16384,
2025-05-07T20:33:33.0077602Z     D=5120,
2025-05-07T20:33:33.0077684Z     scale_ub=None,
2025-05-07T20:33:33.0077764Z     contiguous=True,
2025-05-07T20:33:33.0077845Z     compiled=True,
2025-05-07T20:33:33.0077917Z )
2025-05-07T20:33:33.0078143Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0078319Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:33.0078324Z 
2025-05-07T20:33:33.0078444Z     @given(
2025-05-07T20:33:33.0078562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0078661Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0078780Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0078922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0079052Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0079130Z     )
2025-05-07T20:33:33.0079427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0079521Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0079596Z         self,
2025-05-07T20:33:33.0079678Z         T: int,
2025-05-07T20:33:33.0079756Z         D: int,
2025-05-07T20:33:33.0079854Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0079941Z         contiguous: bool,
2025-05-07T20:33:33.0080029Z         compiled: bool,
2025-05-07T20:33:33.0080106Z     ) -> None:
2025-05-07T20:33:33.0080199Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0080278Z     
2025-05-07T20:33:33.0080446Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0082007Z     
2025-05-07T20:33:33.0082099Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0082222Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0082311Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0082390Z         x0 = x[:, :D]
2025-05-07T20:33:33.0082469Z         x1 = x[:, D:]
2025-05-07T20:33:33.0082547Z     
2025-05-07T20:33:33.0082628Z         if contiguous:
2025-05-07T20:33:33.0082716Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0083052Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0083132Z     
2025-05-07T20:33:33.0083223Z         if scale_ub is not None:
2025-05-07T20:33:33.0083331Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0083467Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0083541Z             )
2025-05-07T20:33:33.0083619Z         else:
2025-05-07T20:33:33.0083714Z             scale_ub_tensor = None
2025-05-07T20:33:33.0083790Z     
2025-05-07T20:33:33.0083919Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0084012Z             op = silu_mul_quant
2025-05-07T20:33:33.0084100Z             if compiled:
2025-05-07T20:33:33.0084200Z                 op = torch.compile(op)
2025-05-07T20:33:33.0084303Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0084380Z     
2025-05-07T20:33:33.0084469Z         y_fp8, y_scale = fn()
2025-05-07T20:33:33.0084588Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:33.0084665Z     
2025-05-07T20:33:33.0084883Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0084988Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:33.0085092Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:33.0085220Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:33.0085376Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0085447Z     
2025-05-07T20:33:33.0085548Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:33.0085552Z 
2025-05-07T20:33:33.0085709Z moe/activation_test.py:126: 
2025-05-07T20:33:33.0085837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0085943Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:33.0086078Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0086687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:33.0086792Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:33.0087176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0087408Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0087863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:33.0088130Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0088561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:33.0088827Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0089287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:33.0089461Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:33.0089824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:33.0089900Z     fn()
2025-05-07T20:33:33.0090336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:33.0090413Z     self.fn.run(
2025-05-07T20:33:33.0090775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0090868Z     kernel = self.compile(
2025-05-07T20:33:33.0091273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0091459Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0091587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0091591Z 
2025-05-07T20:33:33.0091808Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fec72940>
2025-05-07T20:33:33.0092659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0093206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58ff41bee0>}
2025-05-07T20:33:33.0094027Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0094227Z context = <triton._C.libtriton.ir.context object at 0x7f58fe50f0f0>
2025-05-07T20:33:33.0094232Z 
2025-05-07T20:33:33.0094445Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0094723Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0094828Z                            module_map=module_map)
2025-05-07T20:33:33.0094993Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0095097Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:33.0095170Z E       ^
2025-05-07T20:33:33.0095557Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0095602Z 
2025-05-07T20:33:33.0096047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0096052Z 
2025-05-07T20:33:33.0096162Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0096390Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0096463Z     T=1,
2025-05-07T20:33:33.0096538Z     D=5120,
2025-05-07T20:33:33.0096623Z     scale_ub=1200.0,
2025-05-07T20:33:33.0096710Z     contiguous=True,
2025-05-07T20:33:33.0096792Z     compiled=True,
2025-05-07T20:33:33.0096866Z )
2025-05-07T20:33:33.0097092Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0097329Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:33.0097334Z 
2025-05-07T20:33:33.0097408Z     @given(
2025-05-07T20:33:33.0097531Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0097631Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0097746Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0097864Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0098016Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0098092Z     )
2025-05-07T20:33:33.0098350Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0098445Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0098523Z         self,
2025-05-07T20:33:33.0098599Z         T: int,
2025-05-07T20:33:33.0098672Z         D: int,
2025-05-07T20:33:33.0098769Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0098856Z         contiguous: bool,
2025-05-07T20:33:33.0098938Z         compiled: bool,
2025-05-07T20:33:33.0099019Z     ) -> None:
2025-05-07T20:33:33.0099111Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0099182Z     
2025-05-07T20:33:33.0099358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0099429Z     
2025-05-07T20:33:33.0099520Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0099648Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0099734Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0099818Z         x0 = x[:, :D]
2025-05-07T20:33:33.0099898Z         x1 = x[:, D:]
2025-05-07T20:33:33.0099966Z     
2025-05-07T20:33:33.0100049Z         if contiguous:
2025-05-07T20:33:33.0100142Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0100230Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0100304Z     
2025-05-07T20:33:33.0100394Z         if scale_ub is not None:
2025-05-07T20:33:33.0100496Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0100633Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0100704Z             )
2025-05-07T20:33:33.0100778Z         else:
2025-05-07T20:33:33.0100870Z             scale_ub_tensor = None
2025-05-07T20:33:33.0100939Z     
2025-05-07T20:33:33.0101072Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0101160Z             op = silu_mul_quant
2025-05-07T20:33:33.0101241Z             if compiled:
2025-05-07T20:33:33.0101346Z                 op = torch.compile(op)
2025-05-07T20:33:33.0101452Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0101523Z     
2025-05-07T20:33:33.0101613Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0101618Z 
2025-05-07T20:33:33.0101758Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0101889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0101993Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0102093Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0102486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0102577Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0103114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0103252Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0103631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0103866Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0104233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0104326Z     kernel = self.compile(
2025-05-07T20:33:33.0104736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0104949Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0105080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0105085Z 
2025-05-07T20:33:33.0105298Z self = <triton.compiler.compiler.ASTSource object at 0x7f58ff3e82b0>
2025-05-07T20:33:33.0106142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0106728Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fedaf700>}
2025-05-07T20:33:33.0107535Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0107733Z context = <triton._C.libtriton.ir.context object at 0x7f58fdf0ddb0>
2025-05-07T20:33:33.0107738Z 
2025-05-07T20:33:33.0107904Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0108180Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0108287Z                            module_map=module_map)
2025-05-07T20:33:33.0108452Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0108550Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0108630Z E       ^
2025-05-07T20:33:33.0109055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0109062Z 
2025-05-07T20:33:33.0109512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0109517Z 
2025-05-07T20:33:33.0109616Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0109927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0110007Z     T=1,
2025-05-07T20:33:33.0110081Z     D=5120,
2025-05-07T20:33:33.0110162Z     scale_ub=None,
2025-05-07T20:33:33.0110250Z     contiguous=False,
2025-05-07T20:33:33.0110336Z     compiled=True,
2025-05-07T20:33:33.0110412Z )
2025-05-07T20:33:33.0110647Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0110822Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:33.0110826Z 
2025-05-07T20:33:33.0110909Z     @given(
2025-05-07T20:33:33.0111075Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0111175Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0111300Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0111417Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0111536Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0111619Z     )
2025-05-07T20:33:33.0111878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0111977Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0112096Z         self,
2025-05-07T20:33:33.0112172Z         T: int,
2025-05-07T20:33:33.0112256Z         D: int,
2025-05-07T20:33:33.0112357Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0112447Z         contiguous: bool,
2025-05-07T20:33:33.0112544Z         compiled: bool,
2025-05-07T20:33:33.0112624Z     ) -> None:
2025-05-07T20:33:33.0112721Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0112804Z     
2025-05-07T20:33:33.0112982Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0113060Z     
2025-05-07T20:33:33.0113160Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0113286Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0113376Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0113503Z         x0 = x[:, :D]
2025-05-07T20:33:33.0113584Z         x1 = x[:, D:]
2025-05-07T20:33:33.0113664Z     
2025-05-07T20:33:33.0113747Z         if contiguous:
2025-05-07T20:33:33.0113842Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0113941Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0114012Z     
2025-05-07T20:33:33.0114104Z         if scale_ub is not None:
2025-05-07T20:33:33.0114216Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0114393Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0114470Z             )
2025-05-07T20:33:33.0114554Z         else:
2025-05-07T20:33:33.0114651Z             scale_ub_tensor = None
2025-05-07T20:33:33.0114727Z     
2025-05-07T20:33:33.0114860Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0114951Z             op = silu_mul_quant
2025-05-07T20:33:33.0115044Z             if compiled:
2025-05-07T20:33:33.0115146Z                 op = torch.compile(op)
2025-05-07T20:33:33.0115254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0115335Z     
2025-05-07T20:33:33.0115427Z         y_fp8, y_scale = fn()
2025-05-07T20:33:33.0115548Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:33.0115632Z     
2025-05-07T20:33:33.0115769Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0115873Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:33.0115981Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:33.0116103Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:33.0116253Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0116328Z     
2025-05-07T20:33:33.0116428Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:33.0116432Z 
2025-05-07T20:33:33.0116536Z moe/activation_test.py:126: 
2025-05-07T20:33:33.0116668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0116776Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:33.0116921Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0117531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:33.0117633Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:33.0118025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0118260Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0118711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:33.0118985Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0119415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:33.0119688Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0120092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:33.0120310Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:33.0120674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:33.0120754Z     fn()
2025-05-07T20:33:33.0121195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:33.0121280Z     self.fn.run(
2025-05-07T20:33:33.0121640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0121741Z     kernel = self.compile(
2025-05-07T20:33:33.0122191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0122375Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0122511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0122519Z 
2025-05-07T20:33:33.0122731Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe0092e0>
2025-05-07T20:33:33.0123587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0124177Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe6803a0>}
2025-05-07T20:33:33.0125005Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0125203Z context = <triton._C.libtriton.ir.context object at 0x7f58fe483eb0>
2025-05-07T20:33:33.0125212Z 
2025-05-07T20:33:33.0125386Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0125665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0125775Z                            module_map=module_map)
2025-05-07T20:33:33.0125948Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0126055Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:33.0126134Z E       ^
2025-05-07T20:33:33.0126528Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0126533Z 
2025-05-07T20:33:33.0126987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0126991Z 
2025-05-07T20:33:33.0130831Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0131085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0131174Z     T=1,
2025-05-07T20:33:33.0131254Z     D=5120,
2025-05-07T20:33:33.0131340Z     scale_ub=None,
2025-05-07T20:33:33.0131429Z     contiguous=True,
2025-05-07T20:33:33.0131516Z     compiled=False,
2025-05-07T20:33:33.0131595Z )
2025-05-07T20:33:33.0131820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0131989Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:33.0132061Z 
2025-05-07T20:33:33.0132136Z     @given(
2025-05-07T20:33:33.0132266Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0132365Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0132481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0132598Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0132711Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0132790Z     )
2025-05-07T20:33:33.0133053Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0133916Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0134002Z         self,
2025-05-07T20:33:33.0134076Z         T: int,
2025-05-07T20:33:33.0134155Z         D: int,
2025-05-07T20:33:33.0134263Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0134352Z         contiguous: bool,
2025-05-07T20:33:33.0134438Z         compiled: bool,
2025-05-07T20:33:33.0134522Z     ) -> None:
2025-05-07T20:33:33.0134620Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0134697Z     
2025-05-07T20:33:33.0134872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0134951Z     
2025-05-07T20:33:33.0135048Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0135221Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0135313Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0135397Z         x0 = x[:, :D]
2025-05-07T20:33:33.0135475Z         x1 = x[:, D:]
2025-05-07T20:33:33.0135552Z     
2025-05-07T20:33:33.0135641Z         if contiguous:
2025-05-07T20:33:33.0135733Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0135824Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0135899Z     
2025-05-07T20:33:33.0136035Z         if scale_ub is not None:
2025-05-07T20:33:33.0136145Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0136287Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0136363Z             )
2025-05-07T20:33:33.0136438Z         else:
2025-05-07T20:33:33.0136531Z             scale_ub_tensor = None
2025-05-07T20:33:33.0136602Z     
2025-05-07T20:33:33.0136734Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0136825Z             op = silu_mul_quant
2025-05-07T20:33:33.0136910Z             if compiled:
2025-05-07T20:33:33.0137016Z                 op = torch.compile(op)
2025-05-07T20:33:33.0137120Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0137195Z     
2025-05-07T20:33:33.0137292Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0137297Z 
2025-05-07T20:33:33.0137394Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0137530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0137635Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0137735Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0138289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0138387Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0138770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0139026Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0139423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0139522Z     kernel = self.compile(
2025-05-07T20:33:33.0139932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0140109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0140243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0140248Z 
2025-05-07T20:33:33.0140502Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdff83d0>
2025-05-07T20:33:33.0141352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0141897Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe643940>}
2025-05-07T20:33:33.0142716Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0142957Z context = <triton._C.libtriton.ir.context object at 0x7f58fe3cd4b0>
2025-05-07T20:33:33.0142964Z 
2025-05-07T20:33:33.0143133Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0143418Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0143525Z                            module_map=module_map)
2025-05-07T20:33:33.0143692Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0143791Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0143909Z E       ^
2025-05-07T20:33:33.0144296Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0144301Z 
2025-05-07T20:33:33.0144750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0144755Z 
2025-05-07T20:33:33.0144856Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0145126Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0145202Z     T=128,
2025-05-07T20:33:33.0145278Z     D=5120,
2025-05-07T20:33:33.0145358Z     scale_ub=None,
2025-05-07T20:33:33.0145451Z     contiguous=False,
2025-05-07T20:33:33.0145534Z     compiled=True,
2025-05-07T20:33:33.0145605Z )
2025-05-07T20:33:33.0145834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0146006Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:33.0146014Z 
2025-05-07T20:33:33.0146087Z     @given(
2025-05-07T20:33:33.0146211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0146306Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0146423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0146541Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0146656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0146737Z     )
2025-05-07T20:33:33.0146998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0147090Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0147169Z         self,
2025-05-07T20:33:33.0147242Z         T: int,
2025-05-07T20:33:33.0147315Z         D: int,
2025-05-07T20:33:33.0147413Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0147501Z         contiguous: bool,
2025-05-07T20:33:33.0147585Z         compiled: bool,
2025-05-07T20:33:33.0147664Z     ) -> None:
2025-05-07T20:33:33.0147759Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0147832Z     
2025-05-07T20:33:33.0148007Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0148084Z     
2025-05-07T20:33:33.0148178Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0148301Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0148388Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0148473Z         x0 = x[:, :D]
2025-05-07T20:33:33.0148554Z         x1 = x[:, D:]
2025-05-07T20:33:33.0148626Z     
2025-05-07T20:33:33.0148708Z         if contiguous:
2025-05-07T20:33:33.0148795Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0148931Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0149007Z     
2025-05-07T20:33:33.0149096Z         if scale_ub is not None:
2025-05-07T20:33:33.0149200Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0149338Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0149410Z             )
2025-05-07T20:33:33.0149485Z         else:
2025-05-07T20:33:33.0149579Z             scale_ub_tensor = None
2025-05-07T20:33:33.0149650Z     
2025-05-07T20:33:33.0149870Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0150004Z             op = silu_mul_quant
2025-05-07T20:33:33.0150086Z             if compiled:
2025-05-07T20:33:33.0150187Z                 op = torch.compile(op)
2025-05-07T20:33:33.0150291Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0150364Z     
2025-05-07T20:33:33.0150456Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0150461Z 
2025-05-07T20:33:33.0150556Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0150694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0150796Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0150894Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0151329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0151423Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0151961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0152061Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0152443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0152715Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0153080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0153174Z     kernel = self.compile(
2025-05-07T20:33:33.0153582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0153756Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0153893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0153898Z 
2025-05-07T20:33:33.0154110Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe4f6f10>
2025-05-07T20:33:33.0154956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0155514Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe4fc040>}
2025-05-07T20:33:33.0156324Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0156527Z context = <triton._C.libtriton.ir.context object at 0x7f58fe2eba70>
2025-05-07T20:33:33.0156531Z 
2025-05-07T20:33:33.0156700Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0156975Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0157088Z                            module_map=module_map)
2025-05-07T20:33:33.0157249Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0157347Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0157427Z E       ^
2025-05-07T20:33:33.0157848Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0157853Z 
2025-05-07T20:33:33.0158299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0158304Z 
2025-05-07T20:33:33.0158404Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0158635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0158714Z     T=128,
2025-05-07T20:33:33.0158787Z     D=7168,
2025-05-07T20:33:33.0158866Z     scale_ub=1200.0,
2025-05-07T20:33:33.0158992Z     contiguous=False,
2025-05-07T20:33:33.0159073Z     compiled=False,
2025-05-07T20:33:33.0159145Z )
2025-05-07T20:33:33.0159372Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0159551Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:33.0159555Z 
2025-05-07T20:33:33.0159633Z     @given(
2025-05-07T20:33:33.0159750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0159848Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0159967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0160082Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0160191Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0160265Z     )
2025-05-07T20:33:33.0160561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0160656Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0160733Z         self,
2025-05-07T20:33:33.0160808Z         T: int,
2025-05-07T20:33:33.0160884Z         D: int,
2025-05-07T20:33:33.0160981Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0161066Z         contiguous: bool,
2025-05-07T20:33:33.0161193Z         compiled: bool,
2025-05-07T20:33:33.0161271Z     ) -> None:
2025-05-07T20:33:33.0161363Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0161435Z     
2025-05-07T20:33:33.0161607Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0161677Z     
2025-05-07T20:33:33.0161769Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0161893Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0161984Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0162063Z         x0 = x[:, :D]
2025-05-07T20:33:33.0162141Z         x1 = x[:, D:]
2025-05-07T20:33:33.0162220Z     
2025-05-07T20:33:33.0162301Z         if contiguous:
2025-05-07T20:33:33.0162388Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0162481Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0162556Z     
2025-05-07T20:33:33.0162648Z         if scale_ub is not None:
2025-05-07T20:33:33.0162754Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0162887Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0162962Z             )
2025-05-07T20:33:33.0163040Z         else:
2025-05-07T20:33:33.0163134Z             scale_ub_tensor = None
2025-05-07T20:33:33.0163207Z     
2025-05-07T20:33:33.0163339Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0163425Z             op = silu_mul_quant
2025-05-07T20:33:33.0163510Z             if compiled:
2025-05-07T20:33:33.0163610Z                 op = torch.compile(op)
2025-05-07T20:33:33.0163715Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0163789Z     
2025-05-07T20:33:33.0163879Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0163883Z 
2025-05-07T20:33:33.0163978Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0164113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0164209Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0164304Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0164852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0164946Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0165400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0165634Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0165997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0166092Z     kernel = self.compile(
2025-05-07T20:33:33.0166498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0166717Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0166846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0166853Z 
2025-05-07T20:33:33.0167060Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe4cfa60>
2025-05-07T20:33:33.0167919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0168500Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe4fcd30>}
2025-05-07T20:33:33.0169315Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0169511Z context = <triton._C.libtriton.ir.context object at 0x7f58fded16f0>
2025-05-07T20:33:33.0169516Z 
2025-05-07T20:33:33.0169680Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0169996Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0170107Z                            module_map=module_map)
2025-05-07T20:33:33.0170274Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0170368Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0170442Z E       ^
2025-05-07T20:33:33.0170826Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0170831Z 
2025-05-07T20:33:33.0171273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0171280Z 
2025-05-07T20:33:33.0171382Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0171612Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0171687Z     T=128,
2025-05-07T20:33:33.0171762Z     D=5120,
2025-05-07T20:33:33.0171842Z     scale_ub=None,
2025-05-07T20:33:33.0171925Z     contiguous=False,
2025-05-07T20:33:33.0172008Z     compiled=False,
2025-05-07T20:33:33.0172080Z )
2025-05-07T20:33:33.0172305Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0172480Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:33.0172485Z 
2025-05-07T20:33:33.0172556Z     @given(
2025-05-07T20:33:33.0172676Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0172776Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0172889Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0173006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0173119Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0173193Z     )
2025-05-07T20:33:33.0173449Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0173542Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0173617Z         self,
2025-05-07T20:33:33.0173694Z         T: int,
2025-05-07T20:33:33.0173767Z         D: int,
2025-05-07T20:33:33.0173908Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0174000Z         contiguous: bool,
2025-05-07T20:33:33.0174082Z         compiled: bool,
2025-05-07T20:33:33.0174162Z     ) -> None:
2025-05-07T20:33:33.0174256Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0174325Z     
2025-05-07T20:33:33.0174499Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0174571Z     
2025-05-07T20:33:33.0174662Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0174788Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0174918Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0174996Z         x0 = x[:, :D]
2025-05-07T20:33:33.0175077Z         x1 = x[:, D:]
2025-05-07T20:33:33.0175148Z     
2025-05-07T20:33:33.0175229Z         if contiguous:
2025-05-07T20:33:33.0175325Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0175411Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0175488Z     
2025-05-07T20:33:33.0175576Z         if scale_ub is not None:
2025-05-07T20:33:33.0175683Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0175818Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0175891Z             )
2025-05-07T20:33:33.0175967Z         else:
2025-05-07T20:33:33.0176060Z             scale_ub_tensor = None
2025-05-07T20:33:33.0176127Z     
2025-05-07T20:33:33.0176300Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0176391Z             op = silu_mul_quant
2025-05-07T20:33:33.0176474Z             if compiled:
2025-05-07T20:33:33.0176574Z                 op = torch.compile(op)
2025-05-07T20:33:33.0176679Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0176750Z     
2025-05-07T20:33:33.0176839Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0176886Z 
2025-05-07T20:33:33.0176982Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0177112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0177218Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0177316Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0177856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0177956Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0178344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0178580Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0178973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0179085Z     kernel = self.compile(
2025-05-07T20:33:33.0179502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0179679Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0179808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0179813Z 
2025-05-07T20:33:33.0180024Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fde6f670>
2025-05-07T20:33:33.0180871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0181419Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdde2310>}
2025-05-07T20:33:33.0182233Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0182473Z context = <triton._C.libtriton.ir.context object at 0x7f58fddd4bb0>
2025-05-07T20:33:33.0182478Z 
2025-05-07T20:33:33.0182645Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0183175Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0183301Z                            module_map=module_map)
2025-05-07T20:33:33.0183466Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0183563Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0183640Z E       ^
2025-05-07T20:33:33.0184144Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0184149Z 
2025-05-07T20:33:33.0184597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0184605Z 
2025-05-07T20:33:33.0184706Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0184940Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0185018Z     T=128,
2025-05-07T20:33:33.0185095Z     D=5120,
2025-05-07T20:33:33.0185178Z     scale_ub=1200.0,
2025-05-07T20:33:33.0185266Z     contiguous=True,
2025-05-07T20:33:33.0185350Z     compiled=False,
2025-05-07T20:33:33.0185422Z )
2025-05-07T20:33:33.0185708Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0185886Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:33.0185894Z 
2025-05-07T20:33:33.0185975Z     @given(
2025-05-07T20:33:33.0186090Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0186187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0186307Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0186482Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0186595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0186669Z     )
2025-05-07T20:33:33.0186928Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0187023Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0187099Z         self,
2025-05-07T20:33:33.0187175Z         T: int,
2025-05-07T20:33:33.0187253Z         D: int,
2025-05-07T20:33:33.0187355Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0187442Z         contiguous: bool,
2025-05-07T20:33:33.0187532Z         compiled: bool,
2025-05-07T20:33:33.0187609Z     ) -> None:
2025-05-07T20:33:33.0187704Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0187780Z     
2025-05-07T20:33:33.0187951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0188023Z     
2025-05-07T20:33:33.0188118Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0188245Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0188336Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0188416Z         x0 = x[:, :D]
2025-05-07T20:33:33.0188500Z         x1 = x[:, D:]
2025-05-07T20:33:33.0188577Z     
2025-05-07T20:33:33.0188656Z         if contiguous:
2025-05-07T20:33:33.0188745Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0188834Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0188902Z     
2025-05-07T20:33:33.0188992Z         if scale_ub is not None:
2025-05-07T20:33:33.0189099Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0189233Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0189310Z             )
2025-05-07T20:33:33.0189393Z         else:
2025-05-07T20:33:33.0189486Z             scale_ub_tensor = None
2025-05-07T20:33:33.0189561Z     
2025-05-07T20:33:33.0189699Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0189872Z             op = silu_mul_quant
2025-05-07T20:33:33.0189958Z             if compiled:
2025-05-07T20:33:33.0190059Z                 op = torch.compile(op)
2025-05-07T20:33:33.0190162Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0190307Z     
2025-05-07T20:33:33.0190399Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0190404Z 
2025-05-07T20:33:33.0190498Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0190633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0190732Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0190831Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0191378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0191513Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0191898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0192137Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0192503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0192597Z     kernel = self.compile(
2025-05-07T20:33:33.0193008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0193187Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0193354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0193359Z 
2025-05-07T20:33:33.0193569Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fde556d0>
2025-05-07T20:33:33.0194423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0195037Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdde2ee0>}
2025-05-07T20:33:33.0195852Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0196051Z context = <triton._C.libtriton.ir.context object at 0x7f58fe2864b0>
2025-05-07T20:33:33.0196055Z 
2025-05-07T20:33:33.0196224Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0196506Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0196613Z                            module_map=module_map)
2025-05-07T20:33:33.0196775Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0196880Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0196957Z E       ^
2025-05-07T20:33:33.0197341Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0197348Z 
2025-05-07T20:33:33.0197798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0197803Z 
2025-05-07T20:33:33.0197904Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0198139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0198215Z     T=1,
2025-05-07T20:33:33.0198288Z     D=7168,
2025-05-07T20:33:33.0198372Z     scale_ub=1200.0,
2025-05-07T20:33:33.0198456Z     contiguous=True,
2025-05-07T20:33:33.0198538Z     compiled=True,
2025-05-07T20:33:33.0198612Z )
2025-05-07T20:33:33.0198838Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0199007Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:33.0199015Z 
2025-05-07T20:33:33.0199089Z     @given(
2025-05-07T20:33:33.0199206Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0199348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0199464Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0199579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0199693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0199768Z     )
2025-05-07T20:33:33.0200026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0200123Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0200196Z         self,
2025-05-07T20:33:33.0200315Z         T: int,
2025-05-07T20:33:33.0200392Z         D: int,
2025-05-07T20:33:33.0200488Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0200581Z         contiguous: bool,
2025-05-07T20:33:33.0200665Z         compiled: bool,
2025-05-07T20:33:33.0200741Z     ) -> None:
2025-05-07T20:33:33.0200840Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0200912Z     
2025-05-07T20:33:33.0201085Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0201156Z     
2025-05-07T20:33:33.0201245Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0201367Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0201460Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0201537Z         x0 = x[:, :D]
2025-05-07T20:33:33.0201659Z         x1 = x[:, D:]
2025-05-07T20:33:33.0201732Z     
2025-05-07T20:33:33.0201812Z         if contiguous:
2025-05-07T20:33:33.0201902Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0201988Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0202062Z     
2025-05-07T20:33:33.0202152Z         if scale_ub is not None:
2025-05-07T20:33:33.0202254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0202388Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0202507Z             )
2025-05-07T20:33:33.0202579Z         else:
2025-05-07T20:33:33.0202669Z             scale_ub_tensor = None
2025-05-07T20:33:33.0202742Z     
2025-05-07T20:33:33.0202871Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0202957Z             op = silu_mul_quant
2025-05-07T20:33:33.0203040Z             if compiled:
2025-05-07T20:33:33.0203137Z                 op = torch.compile(op)
2025-05-07T20:33:33.0203245Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0203317Z     
2025-05-07T20:33:33.0203405Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0203410Z 
2025-05-07T20:33:33.0203508Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0203643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0203740Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0203841Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0204239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0204333Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0204873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0204968Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0205354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0205589Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0205951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0206051Z     kernel = self.compile(
2025-05-07T20:33:33.0206460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0206640Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0206771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0206775Z 
2025-05-07T20:33:33.0207030Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fddd3610>
2025-05-07T20:33:33.0207896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0208442Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe252940>}
2025-05-07T20:33:33.0209349Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0209549Z context = <triton._C.libtriton.ir.context object at 0x7f58fe42c3b0>
2025-05-07T20:33:33.0209554Z 
2025-05-07T20:33:33.0209719Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0210003Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0210109Z                            module_map=module_map)
2025-05-07T20:33:33.0210272Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0210407Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0210482Z E       ^
2025-05-07T20:33:33.0210863Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0210870Z 
2025-05-07T20:33:33.0211319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0211324Z 
2025-05-07T20:33:33.0211427Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0211699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0211771Z     T=1,
2025-05-07T20:33:33.0211847Z     D=7168,
2025-05-07T20:33:33.0211929Z     scale_ub=1200.0,
2025-05-07T20:33:33.0212015Z     contiguous=False,
2025-05-07T20:33:33.0212099Z     compiled=True,
2025-05-07T20:33:33.0212175Z )
2025-05-07T20:33:33.0212399Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0212572Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:33.0212577Z 
2025-05-07T20:33:33.0212650Z     @given(
2025-05-07T20:33:33.0212774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0212872Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0212987Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0213105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0213217Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0213297Z     )
2025-05-07T20:33:33.0213558Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0213652Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0213729Z         self,
2025-05-07T20:33:33.0213808Z         T: int,
2025-05-07T20:33:33.0213880Z         D: int,
2025-05-07T20:33:33.0213979Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0214067Z         contiguous: bool,
2025-05-07T20:33:33.0214153Z         compiled: bool,
2025-05-07T20:33:33.0214237Z     ) -> None:
2025-05-07T20:33:33.0214331Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0214404Z     
2025-05-07T20:33:33.0214579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0214653Z     
2025-05-07T20:33:33.0214747Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0214875Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0214964Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0215047Z         x0 = x[:, :D]
2025-05-07T20:33:33.0215131Z         x1 = x[:, D:]
2025-05-07T20:33:33.0215204Z     
2025-05-07T20:33:33.0215286Z         if contiguous:
2025-05-07T20:33:33.0215427Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0215518Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0215591Z     
2025-05-07T20:33:33.0215681Z         if scale_ub is not None:
2025-05-07T20:33:33.0215784Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0215923Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0215998Z             )
2025-05-07T20:33:33.0216071Z         else:
2025-05-07T20:33:33.0216169Z             scale_ub_tensor = None
2025-05-07T20:33:33.0216241Z     
2025-05-07T20:33:33.0216411Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0216505Z             op = silu_mul_quant
2025-05-07T20:33:33.0216587Z             if compiled:
2025-05-07T20:33:33.0216685Z                 op = torch.compile(op)
2025-05-07T20:33:33.0216796Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0216867Z     
2025-05-07T20:33:33.0216958Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0216962Z 
2025-05-07T20:33:33.0217059Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0217189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0217290Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0217388Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0217822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0217919Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0218457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0218558Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0218939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0219211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0219578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0219671Z     kernel = self.compile(
2025-05-07T20:33:33.0220078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0220262Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0220390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0220395Z 
2025-05-07T20:33:33.0220608Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe41c550>
2025-05-07T20:33:33.0221458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0222010Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe1f15e0>}
2025-05-07T20:33:33.0222825Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0223022Z context = <triton._C.libtriton.ir.context object at 0x7f58fe08d6f0>
2025-05-07T20:33:33.0223026Z 
2025-05-07T20:33:33.0223197Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0223473Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0223582Z                            module_map=module_map)
2025-05-07T20:33:33.0223744Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0223843Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0223917Z E       ^
2025-05-07T20:33:33.0224339Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0224344Z 
2025-05-07T20:33:33.0224791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0224800Z 
2025-05-07T20:33:33.0224902Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0225134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0225211Z     T=1,
2025-05-07T20:33:33.0225287Z     D=7168,
2025-05-07T20:33:33.0225406Z     scale_ub=None,
2025-05-07T20:33:33.0225498Z     contiguous=False,
2025-05-07T20:33:33.0225580Z     compiled=True,
2025-05-07T20:33:33.0225653Z )
2025-05-07T20:33:33.0225879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0226048Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:33.0226053Z 
2025-05-07T20:33:33.0226131Z     @given(
2025-05-07T20:33:33.0226257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0226354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0226472Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0226585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0226697Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0226835Z     )
2025-05-07T20:33:33.0227096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0227187Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0227265Z         self,
2025-05-07T20:33:33.0227340Z         T: int,
2025-05-07T20:33:33.0227415Z         D: int,
2025-05-07T20:33:33.0227514Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0227605Z         contiguous: bool,
2025-05-07T20:33:33.0227736Z         compiled: bool,
2025-05-07T20:33:33.0227814Z     ) -> None:
2025-05-07T20:33:33.0227907Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0227979Z     
2025-05-07T20:33:33.0228151Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0228222Z     
2025-05-07T20:33:33.0228314Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0228438Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0228524Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0228608Z         x0 = x[:, :D]
2025-05-07T20:33:33.0228687Z         x1 = x[:, D:]
2025-05-07T20:33:33.0228761Z     
2025-05-07T20:33:33.0228852Z         if contiguous:
2025-05-07T20:33:33.0228943Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0229033Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0229108Z     
2025-05-07T20:33:33.0229197Z         if scale_ub is not None:
2025-05-07T20:33:33.0229304Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0229441Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0229516Z             )
2025-05-07T20:33:33.0229593Z         else:
2025-05-07T20:33:33.0229687Z             scale_ub_tensor = None
2025-05-07T20:33:33.0229834Z     
2025-05-07T20:33:33.0229968Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0230057Z             op = silu_mul_quant
2025-05-07T20:33:33.0230139Z             if compiled:
2025-05-07T20:33:33.0230241Z                 op = torch.compile(op)
2025-05-07T20:33:33.0230349Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0230417Z     
2025-05-07T20:33:33.0230511Z         y_fp8, y_scale = fn()
2025-05-07T20:33:33.0230630Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:33.0230711Z     
2025-05-07T20:33:33.0230847Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0230947Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:33.0231050Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:33.0231173Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:33.0231312Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0231436Z     
2025-05-07T20:33:33.0231535Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:33.0231540Z 
2025-05-07T20:33:33.0231636Z moe/activation_test.py:126: 
2025-05-07T20:33:33.0231769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0231874Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:33.0232011Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0232619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:33.0232761Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:33.0233147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0233381Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0233777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:33.0234044Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0234472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:33.0234784Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0235192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:33.0235363Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:33.0235733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:33.0235848Z     fn()
2025-05-07T20:33:33.0236281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:33.0236367Z     self.fn.run(
2025-05-07T20:33:33.0236726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0236824Z     kernel = self.compile(
2025-05-07T20:33:33.0237239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0237415Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0237546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0237553Z 
2025-05-07T20:33:33.0237764Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe0a9160>
2025-05-07T20:33:33.0238619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0239225Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe034160>}
2025-05-07T20:33:33.0240046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0240239Z context = <triton._C.libtriton.ir.context object at 0x7f58fe03b870>
2025-05-07T20:33:33.0240244Z 
2025-05-07T20:33:33.0240415Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0240695Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0240804Z                            module_map=module_map)
2025-05-07T20:33:33.0240969Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0241070Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:33.0241188Z E       ^
2025-05-07T20:33:33.0241576Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0241581Z 
2025-05-07T20:33:33.0242024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0242032Z 
2025-05-07T20:33:33.0242133Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0242368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0242487Z     T=1,
2025-05-07T20:33:33.0242566Z     D=5120,
2025-05-07T20:33:33.0242651Z     scale_ub=1200.0,
2025-05-07T20:33:33.0242736Z     contiguous=False,
2025-05-07T20:33:33.0242817Z     compiled=True,
2025-05-07T20:33:33.0242892Z )
2025-05-07T20:33:33.0243113Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0243288Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:33.0243295Z 
2025-05-07T20:33:33.0243374Z     @given(
2025-05-07T20:33:33.0243491Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0243594Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0243707Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0243869Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0243982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0244058Z     )
2025-05-07T20:33:33.0244321Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0244417Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0244490Z         self,
2025-05-07T20:33:33.0244569Z         T: int,
2025-05-07T20:33:33.0244643Z         D: int,
2025-05-07T20:33:33.0244781Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0244872Z         contiguous: bool,
2025-05-07T20:33:33.0244958Z         compiled: bool,
2025-05-07T20:33:33.0245037Z     ) -> None:
2025-05-07T20:33:33.0245138Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0245211Z     
2025-05-07T20:33:33.0245388Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0245463Z     
2025-05-07T20:33:33.0245552Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0245679Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0245766Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0245847Z         x0 = x[:, :D]
2025-05-07T20:33:33.0245926Z         x1 = x[:, D:]
2025-05-07T20:33:33.0245998Z     
2025-05-07T20:33:33.0246083Z         if contiguous:
2025-05-07T20:33:33.0246176Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0246264Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0246339Z     
2025-05-07T20:33:33.0246431Z         if scale_ub is not None:
2025-05-07T20:33:33.0246538Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0246675Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0246746Z             )
2025-05-07T20:33:33.0246822Z         else:
2025-05-07T20:33:33.0246916Z             scale_ub_tensor = None
2025-05-07T20:33:33.0246991Z     
2025-05-07T20:33:33.0247117Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0247211Z             op = silu_mul_quant
2025-05-07T20:33:33.0247296Z             if compiled:
2025-05-07T20:33:33.0247395Z                 op = torch.compile(op)
2025-05-07T20:33:33.0247505Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0247582Z     
2025-05-07T20:33:33.0247674Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0247679Z 
2025-05-07T20:33:33.0247780Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0247911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0248021Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0248120Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0248559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0248655Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0249248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0249343Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0249731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0249964Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0250371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0250462Z     kernel = self.compile(
2025-05-07T20:33:33.0250872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0252576Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0252707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0252712Z 
2025-05-07T20:33:33.0252929Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe2baf10>
2025-05-07T20:33:33.0253824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0254379Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe034b80>}
2025-05-07T20:33:33.0255201Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0255439Z context = <triton._C.libtriton.ir.context object at 0x7f58fe2ae9b0>
2025-05-07T20:33:33.0255444Z 
2025-05-07T20:33:33.0255619Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0255894Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0256007Z                            module_map=module_map)
2025-05-07T20:33:33.0256174Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0256271Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0256353Z E       ^
2025-05-07T20:33:33.0260347Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0260355Z 
2025-05-07T20:33:33.0260831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0260841Z 
2025-05-07T20:33:33.0260949Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0261185Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0261264Z     T=1,
2025-05-07T20:33:33.0261345Z     D=5120,
2025-05-07T20:33:33.0261429Z     scale_ub=1200.0,
2025-05-07T20:33:33.0261515Z     contiguous=False,
2025-05-07T20:33:33.0261601Z     compiled=False,
2025-05-07T20:33:33.0261675Z )
2025-05-07T20:33:33.0261905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0262086Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:33.0262094Z 
2025-05-07T20:33:33.0262172Z     @given(
2025-05-07T20:33:33.0262293Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0262391Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0262506Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0262629Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0262742Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0262819Z     )
2025-05-07T20:33:33.0263154Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0263250Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0263329Z         self,
2025-05-07T20:33:33.0263404Z         T: int,
2025-05-07T20:33:33.0263477Z         D: int,
2025-05-07T20:33:33.0263579Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0263666Z         contiguous: bool,
2025-05-07T20:33:33.0263750Z         compiled: bool,
2025-05-07T20:33:33.0263832Z     ) -> None:
2025-05-07T20:33:33.0263995Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0264066Z     
2025-05-07T20:33:33.0264247Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0264319Z     
2025-05-07T20:33:33.0264409Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0264537Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0264626Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0264710Z         x0 = x[:, :D]
2025-05-07T20:33:33.0264794Z         x1 = x[:, D:]
2025-05-07T20:33:33.0264866Z     
2025-05-07T20:33:33.0264952Z         if contiguous:
2025-05-07T20:33:33.0265043Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0265132Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0265208Z     
2025-05-07T20:33:33.0265298Z         if scale_ub is not None:
2025-05-07T20:33:33.0265448Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0265587Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0265662Z             )
2025-05-07T20:33:33.0265738Z         else:
2025-05-07T20:33:33.0265836Z             scale_ub_tensor = None
2025-05-07T20:33:33.0265908Z     
2025-05-07T20:33:33.0266038Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0266172Z             op = silu_mul_quant
2025-05-07T20:33:33.0266256Z             if compiled:
2025-05-07T20:33:33.0266361Z                 op = torch.compile(op)
2025-05-07T20:33:33.0266467Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0266539Z     
2025-05-07T20:33:33.0266632Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0266636Z 
2025-05-07T20:33:33.0266734Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0266868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0266978Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0267075Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0267624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0267726Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0268116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0268356Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0268721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0268815Z     kernel = self.compile(
2025-05-07T20:33:33.0269228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0269406Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0269539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0269544Z 
2025-05-07T20:33:33.0269896Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe2a3be0>
2025-05-07T20:33:33.0270757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0271365Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe2b2550>}
2025-05-07T20:33:33.0272179Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0272379Z context = <triton._C.libtriton.ir.context object at 0x7f58fdae5270>
2025-05-07T20:33:33.0272383Z 
2025-05-07T20:33:33.0272552Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0272834Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0272983Z                            module_map=module_map)
2025-05-07T20:33:33.0273152Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0273253Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0273331Z E       ^
2025-05-07T20:33:33.0273722Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0273728Z 
2025-05-07T20:33:33.0274175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0274180Z 
2025-05-07T20:33:33.0274285Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0274556Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0274637Z     T=16384,
2025-05-07T20:33:33.0274719Z     D=5120,
2025-05-07T20:33:33.0274801Z     scale_ub=1200.0,
2025-05-07T20:33:33.0274891Z     contiguous=False,
2025-05-07T20:33:33.0274982Z     compiled=True,
2025-05-07T20:33:33.0275058Z )
2025-05-07T20:33:33.0275280Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0275511Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:33.0275516Z 
2025-05-07T20:33:33.0275593Z     @given(
2025-05-07T20:33:33.0275714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0275818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0275933Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0276055Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0276169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0276247Z     )
2025-05-07T20:33:33.0276511Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0276606Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0276684Z         self,
2025-05-07T20:33:33.0276768Z         T: int,
2025-05-07T20:33:33.0276848Z         D: int,
2025-05-07T20:33:33.0276948Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0277037Z         contiguous: bool,
2025-05-07T20:33:33.0277126Z         compiled: bool,
2025-05-07T20:33:33.0277210Z     ) -> None:
2025-05-07T20:33:33.0277300Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0277372Z     
2025-05-07T20:33:33.0277549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0277622Z     
2025-05-07T20:33:33.0277713Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0277842Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0277930Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0278011Z         x0 = x[:, :D]
2025-05-07T20:33:33.0278098Z         x1 = x[:, D:]
2025-05-07T20:33:33.0278172Z     
2025-05-07T20:33:33.0278258Z         if contiguous:
2025-05-07T20:33:33.0278351Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0278444Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0278519Z     
2025-05-07T20:33:33.0278611Z         if scale_ub is not None:
2025-05-07T20:33:33.0278715Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0278851Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0278929Z             )
2025-05-07T20:33:33.0279006Z         else:
2025-05-07T20:33:33.0279104Z             scale_ub_tensor = None
2025-05-07T20:33:33.0279176Z     
2025-05-07T20:33:33.0279353Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0279445Z             op = silu_mul_quant
2025-05-07T20:33:33.0279527Z             if compiled:
2025-05-07T20:33:33.0279626Z                 op = torch.compile(op)
2025-05-07T20:33:33.0279731Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0279807Z     
2025-05-07T20:33:33.0279905Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0279910Z 
2025-05-07T20:33:33.0280004Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0280176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0280278Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0280375Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0280769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0280865Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0281406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0281506Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0281888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0282163Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0282530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0282625Z     kernel = self.compile(
2025-05-07T20:33:33.0283293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0283577Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0283707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0283712Z 
2025-05-07T20:33:33.0283934Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdb636d0>
2025-05-07T20:33:33.0284790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0285345Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe22a1f0>}
2025-05-07T20:33:33.0286160Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0286358Z context = <triton._C.libtriton.ir.context object at 0x7f58fe22b2b0>
2025-05-07T20:33:33.0286362Z 
2025-05-07T20:33:33.0286531Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0286807Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0286918Z                            module_map=module_map)
2025-05-07T20:33:33.0287083Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0287181Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0287260Z E       ^
2025-05-07T20:33:33.0287642Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0287649Z 
2025-05-07T20:33:33.0288097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0288104Z 
2025-05-07T20:33:33.0288210Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0288439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0288520Z     T=2048,
2025-05-07T20:33:33.0288594Z     D=7168,
2025-05-07T20:33:33.0288738Z     scale_ub=1200.0,
2025-05-07T20:33:33.0288829Z     contiguous=False,
2025-05-07T20:33:33.0288914Z     compiled=True,
2025-05-07T20:33:33.0288991Z )
2025-05-07T20:33:33.0289218Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0289401Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:33.0289405Z 
2025-05-07T20:33:33.0289482Z     @given(
2025-05-07T20:33:33.0289605Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0289766Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0289884Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0290000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0290114Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0290192Z     )
2025-05-07T20:33:33.0290451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0290546Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0290627Z         self,
2025-05-07T20:33:33.0290706Z         T: int,
2025-05-07T20:33:33.0290783Z         D: int,
2025-05-07T20:33:33.0290885Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0290977Z         contiguous: bool,
2025-05-07T20:33:33.0291067Z         compiled: bool,
2025-05-07T20:33:33.0291208Z     ) -> None:
2025-05-07T20:33:33.0291304Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0291382Z     
2025-05-07T20:33:33.0291553Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0291629Z     
2025-05-07T20:33:33.0291723Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0291847Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0291935Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0292065Z         x0 = x[:, :D]
2025-05-07T20:33:33.0292144Z         x1 = x[:, D:]
2025-05-07T20:33:33.0292215Z     
2025-05-07T20:33:33.0292300Z         if contiguous:
2025-05-07T20:33:33.0292395Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0292487Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0292566Z     
2025-05-07T20:33:33.0292658Z         if scale_ub is not None:
2025-05-07T20:33:33.0292766Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0292904Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0292980Z             )
2025-05-07T20:33:33.0293058Z         else:
2025-05-07T20:33:33.0293151Z             scale_ub_tensor = None
2025-05-07T20:33:33.0293226Z     
2025-05-07T20:33:33.0293363Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0293456Z             op = silu_mul_quant
2025-05-07T20:33:33.0293541Z             if compiled:
2025-05-07T20:33:33.0293642Z                 op = torch.compile(op)
2025-05-07T20:33:33.0293752Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0293829Z     
2025-05-07T20:33:33.0293920Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0293925Z 
2025-05-07T20:33:33.0294023Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0294157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0294258Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0294358Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0294757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0294852Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0295391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0295493Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0295876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0296115Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0296548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0296644Z     kernel = self.compile(
2025-05-07T20:33:33.0297057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0297236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0297373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0297377Z 
2025-05-07T20:33:33.0297635Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe236f10>
2025-05-07T20:33:33.0298481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0299033Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe22aee0>}
2025-05-07T20:33:33.0299848Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0300086Z context = <triton._C.libtriton.ir.context object at 0x7f58fe1610b0>
2025-05-07T20:33:33.0300091Z 
2025-05-07T20:33:33.0300261Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0300542Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0300649Z                            module_map=module_map)
2025-05-07T20:33:33.0300813Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0300956Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0301035Z E       ^
2025-05-07T20:33:33.0301418Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0301423Z 
2025-05-07T20:33:33.0301870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0301875Z 
2025-05-07T20:33:33.0301975Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0302210Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0302288Z     T=1,
2025-05-07T20:33:33.0302364Z     D=5120,
2025-05-07T20:33:33.0302455Z     scale_ub=None,
2025-05-07T20:33:33.0302541Z     contiguous=False,
2025-05-07T20:33:33.0302624Z     compiled=False,
2025-05-07T20:33:33.0302698Z )
2025-05-07T20:33:33.0302921Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0303094Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:33.0303098Z 
2025-05-07T20:33:33.0303177Z     @given(
2025-05-07T20:33:33.0303299Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0303397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0303516Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0303633Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0303750Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0303824Z     )
2025-05-07T20:33:33.0304080Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0304177Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0304258Z         self,
2025-05-07T20:33:33.0304337Z         T: int,
2025-05-07T20:33:33.0304417Z         D: int,
2025-05-07T20:33:33.0304516Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0304609Z         contiguous: bool,
2025-05-07T20:33:33.0304698Z         compiled: bool,
2025-05-07T20:33:33.0304775Z     ) -> None:
2025-05-07T20:33:33.0304873Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0304951Z     
2025-05-07T20:33:33.0305168Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0305249Z     
2025-05-07T20:33:33.0305342Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0305468Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0305563Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0305649Z         x0 = x[:, :D]
2025-05-07T20:33:33.0305731Z         x1 = x[:, D:]
2025-05-07T20:33:33.0305810Z     
2025-05-07T20:33:33.0305894Z         if contiguous:
2025-05-07T20:33:33.0305988Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0306126Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0306199Z     
2025-05-07T20:33:33.0306291Z         if scale_ub is not None:
2025-05-07T20:33:33.0306399Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0306537Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0306615Z             )
2025-05-07T20:33:33.0306688Z         else:
2025-05-07T20:33:33.0306784Z             scale_ub_tensor = None
2025-05-07T20:33:33.0306861Z     
2025-05-07T20:33:33.0306990Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0307081Z             op = silu_mul_quant
2025-05-07T20:33:33.0307169Z             if compiled:
2025-05-07T20:33:33.0307268Z                 op = torch.compile(op)
2025-05-07T20:33:33.0307414Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0307496Z     
2025-05-07T20:33:33.0307588Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0307592Z 
2025-05-07T20:33:33.0307692Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0307825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0307926Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0308029Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0308615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0308719Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0309112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0309346Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0309715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0309888Z     kernel = self.compile(
2025-05-07T20:33:33.0310299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0310486Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0310615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0310622Z 
2025-05-07T20:33:33.0310834Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe16ddc0>
2025-05-07T20:33:33.0311685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0312233Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe1705e0>}
2025-05-07T20:33:33.0313053Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0313252Z context = <triton._C.libtriton.ir.context object at 0x7f58fdd5a970>
2025-05-07T20:33:33.0313256Z 
2025-05-07T20:33:33.0313431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0313708Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0313860Z                            module_map=module_map)
2025-05-07T20:33:33.0314027Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0314124Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0314199Z E       ^
2025-05-07T20:33:33.0314585Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0314589Z 
2025-05-07T20:33:33.0315034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0315079Z 
2025-05-07T20:33:33.0315185Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0315413Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0315493Z     T=4096,
2025-05-07T20:33:33.0315570Z     D=7168,
2025-05-07T20:33:33.0315653Z     scale_ub=1200.0,
2025-05-07T20:33:33.0315742Z     contiguous=False,
2025-05-07T20:33:33.0315828Z     compiled=False,
2025-05-07T20:33:33.0315909Z )
2025-05-07T20:33:33.0316137Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0316321Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:33.0316326Z 
2025-05-07T20:33:33.0316400Z     @given(
2025-05-07T20:33:33.0316562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0316663Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0316778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0316907Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0317022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0317099Z     )
2025-05-07T20:33:33.0317360Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0317491Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0317570Z         self,
2025-05-07T20:33:33.0317645Z         T: int,
2025-05-07T20:33:33.0317726Z         D: int,
2025-05-07T20:33:33.0317823Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0317909Z         contiguous: bool,
2025-05-07T20:33:33.0317994Z         compiled: bool,
2025-05-07T20:33:33.0318074Z     ) -> None:
2025-05-07T20:33:33.0318164Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0318233Z     
2025-05-07T20:33:33.0318410Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0318485Z     
2025-05-07T20:33:33.0318577Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0318707Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0318794Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0318882Z         x0 = x[:, :D]
2025-05-07T20:33:33.0318958Z         x1 = x[:, D:]
2025-05-07T20:33:33.0319033Z     
2025-05-07T20:33:33.0319120Z         if contiguous:
2025-05-07T20:33:33.0319210Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0319295Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0319375Z     
2025-05-07T20:33:33.0319466Z         if scale_ub is not None:
2025-05-07T20:33:33.0319570Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0319703Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0319786Z             )
2025-05-07T20:33:33.0319861Z         else:
2025-05-07T20:33:33.0319957Z             scale_ub_tensor = None
2025-05-07T20:33:33.0320034Z     
2025-05-07T20:33:33.0320162Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0320252Z             op = silu_mul_quant
2025-05-07T20:33:33.0320337Z             if compiled:
2025-05-07T20:33:33.0320435Z                 op = torch.compile(op)
2025-05-07T20:33:33.0320541Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0320613Z     
2025-05-07T20:33:33.0320705Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0320709Z 
2025-05-07T20:33:33.0320810Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0320984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0321083Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0321183Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0321722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0321826Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0322208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0322439Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0322854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0322948Z     kernel = self.compile(
2025-05-07T20:33:33.0323356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0323536Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0323664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0323668Z 
2025-05-07T20:33:33.0323877Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe18a970>
2025-05-07T20:33:33.0324764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0325315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdab01f0>}
2025-05-07T20:33:33.0326125Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0326387Z context = <triton._C.libtriton.ir.context object at 0x7f58fdac0d30>
2025-05-07T20:33:33.0326391Z 
2025-05-07T20:33:33.0326560Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0326833Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0326942Z                            module_map=module_map)
2025-05-07T20:33:33.0327106Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0327202Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0327285Z E       ^
2025-05-07T20:33:33.0327662Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0327667Z 
2025-05-07T20:33:33.0328110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0328118Z 
2025-05-07T20:33:33.0328216Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0328447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0328523Z     T=16384,
2025-05-07T20:33:33.0328598Z     D=7168,
2025-05-07T20:33:33.0328679Z     scale_ub=None,
2025-05-07T20:33:33.0328764Z     contiguous=True,
2025-05-07T20:33:33.0328845Z     compiled=True,
2025-05-07T20:33:33.0328938Z )
2025-05-07T20:33:33.0329196Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0329372Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:33.0329379Z 
2025-05-07T20:33:33.0329456Z     @given(
2025-05-07T20:33:33.0329575Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0329673Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0329791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0329905Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0330060Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0330140Z     )
2025-05-07T20:33:33.0330397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0330485Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0330561Z         self,
2025-05-07T20:33:33.0330634Z         T: int,
2025-05-07T20:33:33.0330709Z         D: int,
2025-05-07T20:33:33.0330809Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0330895Z         contiguous: bool,
2025-05-07T20:33:33.0330984Z         compiled: bool,
2025-05-07T20:33:33.0331102Z     ) -> None:
2025-05-07T20:33:33.0331194Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0331271Z     
2025-05-07T20:33:33.0331442Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0331518Z     
2025-05-07T20:33:33.0331609Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0331730Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0331817Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0331901Z         x0 = x[:, :D]
2025-05-07T20:33:33.0331980Z         x1 = x[:, D:]
2025-05-07T20:33:33.0332052Z     
2025-05-07T20:33:33.0332136Z         if contiguous:
2025-05-07T20:33:33.0332223Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0332316Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0332387Z     
2025-05-07T20:33:33.0332519Z         if scale_ub is not None:
2025-05-07T20:33:33.0332626Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0332761Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0332838Z             )
2025-05-07T20:33:33.0332915Z         else:
2025-05-07T20:33:33.0333008Z             scale_ub_tensor = None
2025-05-07T20:33:33.0333080Z     
2025-05-07T20:33:33.0333209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0333342Z             op = silu_mul_quant
2025-05-07T20:33:33.0333423Z             if compiled:
2025-05-07T20:33:33.0333524Z                 op = torch.compile(op)
2025-05-07T20:33:33.0333630Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0333704Z     
2025-05-07T20:33:33.0333793Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0333797Z 
2025-05-07T20:33:33.0333891Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0334022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0334124Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0334223Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0334622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0334714Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0335249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0335354Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0335736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0335969Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0336329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0336421Z     kernel = self.compile(
2025-05-07T20:33:33.0336834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0337010Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0337142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0337147Z 
2025-05-07T20:33:33.0337355Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdac1d00>
2025-05-07T20:33:33.0338246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0338795Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdab0ee0>}
2025-05-07T20:33:33.0339660Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0339857Z context = <triton._C.libtriton.ir.context object at 0x7f58fe11c030>
2025-05-07T20:33:33.0339899Z 
2025-05-07T20:33:33.0340066Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0340339Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0340448Z                            module_map=module_map)
2025-05-07T20:33:33.0340612Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0340712Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0340787Z E       ^
2025-05-07T20:33:33.0341167Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0341171Z 
2025-05-07T20:33:33.0341657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0341662Z 
2025-05-07T20:33:33.0341761Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0341998Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0342073Z     T=4096,
2025-05-07T20:33:33.0342148Z     D=5120,
2025-05-07T20:33:33.0342229Z     scale_ub=None,
2025-05-07T20:33:33.0342357Z     contiguous=False,
2025-05-07T20:33:33.0342441Z     compiled=True,
2025-05-07T20:33:33.0342516Z )
2025-05-07T20:33:33.0342740Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0342917Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:33.0342921Z 
2025-05-07T20:33:33.0342996Z     @given(
2025-05-07T20:33:33.0343114Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0343218Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0343334Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0343448Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0343564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0343637Z     )
2025-05-07T20:33:33.0343893Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0343985Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0344062Z         self,
2025-05-07T20:33:33.0344139Z         T: int,
2025-05-07T20:33:33.0344216Z         D: int,
2025-05-07T20:33:33.0344313Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0344401Z         contiguous: bool,
2025-05-07T20:33:33.0344496Z         compiled: bool,
2025-05-07T20:33:33.0344573Z     ) -> None:
2025-05-07T20:33:33.0344669Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0344738Z     
2025-05-07T20:33:33.0344905Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0344979Z     
2025-05-07T20:33:33.0345072Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0345197Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0345285Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0345367Z         x0 = x[:, :D]
2025-05-07T20:33:33.0345446Z         x1 = x[:, D:]
2025-05-07T20:33:33.0345523Z     
2025-05-07T20:33:33.0345605Z         if contiguous:
2025-05-07T20:33:33.0345694Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0345787Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0345856Z     
2025-05-07T20:33:33.0345945Z         if scale_ub is not None:
2025-05-07T20:33:33.0346055Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0346240Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0346318Z             )
2025-05-07T20:33:33.0346392Z         else:
2025-05-07T20:33:33.0346484Z             scale_ub_tensor = None
2025-05-07T20:33:33.0346560Z     
2025-05-07T20:33:33.0346686Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0346774Z             op = silu_mul_quant
2025-05-07T20:33:33.0346858Z             if compiled:
2025-05-07T20:33:33.0346957Z                 op = torch.compile(op)
2025-05-07T20:33:33.0347101Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0347173Z     
2025-05-07T20:33:33.0347263Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0347267Z 
2025-05-07T20:33:33.0347364Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0347498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0347596Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0347697Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0348091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0348184Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0348765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0348863Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0349244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0349477Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0349904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0350044Z     kernel = self.compile(
2025-05-07T20:33:33.0350450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0350629Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0350760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0350765Z 
2025-05-07T20:33:33.0350971Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdd474f0>
2025-05-07T20:33:33.0351823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0352368Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fe13a940>}
2025-05-07T20:33:33.0353188Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0353381Z context = <triton._C.libtriton.ir.context object at 0x7f58fddaf070>
2025-05-07T20:33:33.0353386Z 
2025-05-07T20:33:33.0353552Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0353833Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0353937Z                            module_map=module_map)
2025-05-07T20:33:33.0354100Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0354202Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0354277Z E       ^
2025-05-07T20:33:33.0354659Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0354667Z 
2025-05-07T20:33:33.0355111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0355115Z 
2025-05-07T20:33:33.0355260Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0355497Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0355572Z     T=4096,
2025-05-07T20:33:33.0355650Z     D=5120,
2025-05-07T20:33:33.0355729Z     scale_ub=1200.0,
2025-05-07T20:33:33.0355810Z     contiguous=False,
2025-05-07T20:33:33.0355900Z     compiled=False,
2025-05-07T20:33:33.0355974Z )
2025-05-07T20:33:33.0356196Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0356419Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:33.0356424Z 
2025-05-07T20:33:33.0356496Z     @given(
2025-05-07T20:33:33.0356614Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0356716Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0356831Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0356953Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0357066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0357140Z     )
2025-05-07T20:33:33.0357399Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0357490Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0357565Z         self,
2025-05-07T20:33:33.0357705Z         T: int,
2025-05-07T20:33:33.0357780Z         D: int,
2025-05-07T20:33:33.0357877Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0357964Z         contiguous: bool,
2025-05-07T20:33:33.0358051Z         compiled: bool,
2025-05-07T20:33:33.0358128Z     ) -> None:
2025-05-07T20:33:33.0358222Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0358294Z     
2025-05-07T20:33:33.0358466Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0358579Z     
2025-05-07T20:33:33.0358669Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0358796Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0358883Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0358962Z         x0 = x[:, :D]
2025-05-07T20:33:33.0359044Z         x1 = x[:, D:]
2025-05-07T20:33:33.0359113Z     
2025-05-07T20:33:33.0359196Z         if contiguous:
2025-05-07T20:33:33.0359288Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0359375Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0359446Z     
2025-05-07T20:33:33.0359539Z         if scale_ub is not None:
2025-05-07T20:33:33.0359641Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0359779Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0359852Z             )
2025-05-07T20:33:33.0359925Z         else:
2025-05-07T20:33:33.0360021Z             scale_ub_tensor = None
2025-05-07T20:33:33.0360095Z     
2025-05-07T20:33:33.0360222Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0360311Z             op = silu_mul_quant
2025-05-07T20:33:33.0360395Z             if compiled:
2025-05-07T20:33:33.0360495Z                 op = torch.compile(op)
2025-05-07T20:33:33.0360605Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0360677Z     
2025-05-07T20:33:33.0360765Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0360769Z 
2025-05-07T20:33:33.0360867Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0360999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0361100Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0361199Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0361739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0361838Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0362224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0362456Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0362865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0362958Z     kernel = self.compile(
2025-05-07T20:33:33.0363367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0363543Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0363670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0363714Z 
2025-05-07T20:33:33.0363924Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdf80250>
2025-05-07T20:33:33.0364771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0365325Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdbd93a0>}
2025-05-07T20:33:33.0366175Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0366372Z context = <triton._C.libtriton.ir.context object at 0x7f58fdbf68f0>
2025-05-07T20:33:33.0366379Z 
2025-05-07T20:33:33.0366547Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0366823Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0366928Z                            module_map=module_map)
2025-05-07T20:33:33.0367128Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0367225Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0367302Z E       ^
2025-05-07T20:33:33.0367686Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0367691Z 
2025-05-07T20:33:33.0368137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0368142Z 
2025-05-07T20:33:33.0368244Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0368474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0368551Z     T=4096,
2025-05-07T20:33:33.0368624Z     D=5120,
2025-05-07T20:33:33.0368711Z     scale_ub=1200.0,
2025-05-07T20:33:33.0368815Z     contiguous=False,
2025-05-07T20:33:33.0368905Z     compiled=True,
2025-05-07T20:33:33.0368991Z )
2025-05-07T20:33:33.0369231Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0369409Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:33.0369414Z 
2025-05-07T20:33:33.0369493Z     @given(
2025-05-07T20:33:33.0369610Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0369705Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0369823Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0369937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0370051Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0370125Z     )
2025-05-07T20:33:33.0370381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0370475Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0370549Z         self,
2025-05-07T20:33:33.0370625Z         T: int,
2025-05-07T20:33:33.0370701Z         D: int,
2025-05-07T20:33:33.0370796Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0370886Z         contiguous: bool,
2025-05-07T20:33:33.0370973Z         compiled: bool,
2025-05-07T20:33:33.0371050Z     ) -> None:
2025-05-07T20:33:33.0371187Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0371263Z     
2025-05-07T20:33:33.0371431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0371503Z     
2025-05-07T20:33:33.0371596Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0371720Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0371810Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0371893Z         x0 = x[:, :D]
2025-05-07T20:33:33.0371973Z         x1 = x[:, D:]
2025-05-07T20:33:33.0372046Z     
2025-05-07T20:33:33.0372171Z         if contiguous:
2025-05-07T20:33:33.0372262Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0372350Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0372424Z     
2025-05-07T20:33:33.0372512Z         if scale_ub is not None:
2025-05-07T20:33:33.0372620Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0372753Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0372830Z             )
2025-05-07T20:33:33.0372912Z         else:
2025-05-07T20:33:33.0373004Z             scale_ub_tensor = None
2025-05-07T20:33:33.0373077Z     
2025-05-07T20:33:33.0373207Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0373296Z             op = silu_mul_quant
2025-05-07T20:33:33.0373382Z             if compiled:
2025-05-07T20:33:33.0373522Z                 op = torch.compile(op)
2025-05-07T20:33:33.0373628Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0373702Z     
2025-05-07T20:33:33.0373793Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0373800Z 
2025-05-07T20:33:33.0373896Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0374030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0374130Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0374269Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0374661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0374754Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0375292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0375387Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0375769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0376005Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0376366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0376458Z     kernel = self.compile(
2025-05-07T20:33:33.0376867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0377046Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0377178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0377183Z 
2025-05-07T20:33:33.0377393Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fe0e9580>
2025-05-07T20:33:33.0378242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0378792Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdbd9280>}
2025-05-07T20:33:33.0379653Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0379852Z context = <triton._C.libtriton.ir.context object at 0x7f58fde13970>
2025-05-07T20:33:33.0379898Z 
2025-05-07T20:33:33.0380064Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0380340Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0380445Z                            module_map=module_map)
2025-05-07T20:33:33.0380607Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0380706Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0380780Z E       ^
2025-05-07T20:33:33.0381156Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0381199Z 
2025-05-07T20:33:33.0381645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0381652Z 
2025-05-07T20:33:33.0381751Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0381986Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0382061Z     T=2048,
2025-05-07T20:33:33.0382137Z     D=7168,
2025-05-07T20:33:33.0382225Z     scale_ub=1200.0,
2025-05-07T20:33:33.0382309Z     contiguous=False,
2025-05-07T20:33:33.0382394Z     compiled=False,
2025-05-07T20:33:33.0382473Z )
2025-05-07T20:33:33.0382909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0383147Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:33.0383156Z 
2025-05-07T20:33:33.0383238Z     @given(
2025-05-07T20:33:33.0383357Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0383455Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0387104Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0387353Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0387470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0387550Z     )
2025-05-07T20:33:33.0387816Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0387915Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0387993Z         self,
2025-05-07T20:33:33.0388069Z         T: int,
2025-05-07T20:33:33.0388146Z         D: int,
2025-05-07T20:33:33.0388246Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0388341Z         contiguous: bool,
2025-05-07T20:33:33.0388436Z         compiled: bool,
2025-05-07T20:33:33.0388516Z     ) -> None:
2025-05-07T20:33:33.0388614Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0388691Z     
2025-05-07T20:33:33.0388870Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0388959Z     
2025-05-07T20:33:33.0389067Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0389218Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0389308Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0389386Z         x0 = x[:, :D]
2025-05-07T20:33:33.0389467Z         x1 = x[:, D:]
2025-05-07T20:33:33.0389542Z     
2025-05-07T20:33:33.0389624Z         if contiguous:
2025-05-07T20:33:33.0389715Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0389879Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0389953Z     
2025-05-07T20:33:33.0390044Z         if scale_ub is not None:
2025-05-07T20:33:33.0390156Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0390293Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0390374Z             )
2025-05-07T20:33:33.0390453Z         else:
2025-05-07T20:33:33.0390548Z             scale_ub_tensor = None
2025-05-07T20:33:33.0390623Z     
2025-05-07T20:33:33.0390751Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0390842Z             op = silu_mul_quant
2025-05-07T20:33:33.0390938Z             if compiled:
2025-05-07T20:33:33.0391039Z                 op = torch.compile(op)
2025-05-07T20:33:33.0391145Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0391292Z     
2025-05-07T20:33:33.0391385Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0391390Z 
2025-05-07T20:33:33.0391493Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0391628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0391727Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0391833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0392384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0392567Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0392958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0393196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0393564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0393659Z     kernel = self.compile(
2025-05-07T20:33:33.0394072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0394256Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0394447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0394452Z 
2025-05-07T20:33:33.0394665Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fde498e0>
2025-05-07T20:33:33.0395530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0396124Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fde39670>}
2025-05-07T20:33:33.0396953Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0397154Z context = <triton._C.libtriton.ir.context object at 0x7f58fdca1eb0>
2025-05-07T20:33:33.0397161Z 
2025-05-07T20:33:33.0397341Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0397621Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0397731Z                            module_map=module_map)
2025-05-07T20:33:33.0397900Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0397998Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0398078Z E       ^
2025-05-07T20:33:33.0398468Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0398475Z 
2025-05-07T20:33:33.0398924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0398928Z 
2025-05-07T20:33:33.0399038Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0399271Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0399348Z     T=1,
2025-05-07T20:33:33.0399430Z     D=7168,
2025-05-07T20:33:33.0399515Z     scale_ub=None,
2025-05-07T20:33:33.0399601Z     contiguous=True,
2025-05-07T20:33:33.0399691Z     compiled=False,
2025-05-07T20:33:33.0399768Z )
2025-05-07T20:33:33.0399994Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0400163Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:33.0400170Z 
2025-05-07T20:33:33.0400257Z     @given(
2025-05-07T20:33:33.0400376Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0400519Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0400638Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0400756Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0400874Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0400953Z     )
2025-05-07T20:33:33.0401216Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0401313Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0401390Z         self,
2025-05-07T20:33:33.0401507Z         T: int,
2025-05-07T20:33:33.0401591Z         D: int,
2025-05-07T20:33:33.0401689Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0401778Z         contiguous: bool,
2025-05-07T20:33:33.0401872Z         compiled: bool,
2025-05-07T20:33:33.0401954Z     ) -> None:
2025-05-07T20:33:33.0402049Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0402126Z     
2025-05-07T20:33:33.0402300Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0402378Z     
2025-05-07T20:33:33.0402476Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0402604Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0402694Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0402776Z         x0 = x[:, :D]
2025-05-07T20:33:33.0402857Z         x1 = x[:, D:]
2025-05-07T20:33:33.0402973Z     
2025-05-07T20:33:33.0403059Z         if contiguous:
2025-05-07T20:33:33.0403151Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0403246Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0403325Z     
2025-05-07T20:33:33.0403417Z         if scale_ub is not None:
2025-05-07T20:33:33.0403530Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0403668Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0403785Z             )
2025-05-07T20:33:33.0403864Z         else:
2025-05-07T20:33:33.0403959Z             scale_ub_tensor = None
2025-05-07T20:33:33.0404037Z     
2025-05-07T20:33:33.0404170Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0404258Z             op = silu_mul_quant
2025-05-07T20:33:33.0404347Z             if compiled:
2025-05-07T20:33:33.0404447Z                 op = torch.compile(op)
2025-05-07T20:33:33.0404554Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0404630Z     
2025-05-07T20:33:33.0404724Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0404728Z 
2025-05-07T20:33:33.0404825Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0404961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0405064Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0405165Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0405711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0405810Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0406199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0406433Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0406797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0406894Z     kernel = self.compile(
2025-05-07T20:33:33.0407306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0407489Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0407620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0407624Z 
2025-05-07T20:33:33.0407837Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fde3d100>
2025-05-07T20:33:33.0408740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0409290Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd9d6280>}
2025-05-07T20:33:33.0410109Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0410343Z context = <triton._C.libtriton.ir.context object at 0x7f58fd9e8b70>
2025-05-07T20:33:33.0410348Z 
2025-05-07T20:33:33.0410519Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0410803Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0410909Z                            module_map=module_map)
2025-05-07T20:33:33.0411077Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0411177Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0411252Z E       ^
2025-05-07T20:33:33.0411641Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0411645Z 
2025-05-07T20:33:33.0412132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0412137Z 
2025-05-07T20:33:33.0412251Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0412485Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0412564Z     T=16384,
2025-05-07T20:33:33.0412649Z     D=7168,
2025-05-07T20:33:33.0412777Z     scale_ub=1200.0,
2025-05-07T20:33:33.0412869Z     contiguous=False,
2025-05-07T20:33:33.0412963Z     compiled=True,
2025-05-07T20:33:33.0413041Z )
2025-05-07T20:33:33.0413273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0413461Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:33.0413466Z 
2025-05-07T20:33:33.0413543Z     @given(
2025-05-07T20:33:33.0413666Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0413768Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0413884Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0414008Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0414124Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0414198Z     )
2025-05-07T20:33:33.0414461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0414557Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0414633Z         self,
2025-05-07T20:33:33.0414711Z         T: int,
2025-05-07T20:33:33.0414787Z         D: int,
2025-05-07T20:33:33.0414889Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0414979Z         contiguous: bool,
2025-05-07T20:33:33.0415066Z         compiled: bool,
2025-05-07T20:33:33.0415146Z     ) -> None:
2025-05-07T20:33:33.0415240Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0415313Z     
2025-05-07T20:33:33.0415490Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0415569Z     
2025-05-07T20:33:33.0415660Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0415791Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0415882Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0415962Z         x0 = x[:, :D]
2025-05-07T20:33:33.0416045Z         x1 = x[:, D:]
2025-05-07T20:33:33.0416119Z     
2025-05-07T20:33:33.0416202Z         if contiguous:
2025-05-07T20:33:33.0416300Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0416388Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0416466Z     
2025-05-07T20:33:33.0416559Z         if scale_ub is not None:
2025-05-07T20:33:33.0416710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0416851Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0416925Z             )
2025-05-07T20:33:33.0417002Z         else:
2025-05-07T20:33:33.0417101Z             scale_ub_tensor = None
2025-05-07T20:33:33.0417177Z     
2025-05-07T20:33:33.0417307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0417401Z             op = silu_mul_quant
2025-05-07T20:33:33.0417486Z             if compiled:
2025-05-07T20:33:33.0417588Z                 op = torch.compile(op)
2025-05-07T20:33:33.0417740Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0417812Z     
2025-05-07T20:33:33.0417907Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0417911Z 
2025-05-07T20:33:33.0418011Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0418145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0418252Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0418354Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0418752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0418861Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0419479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0419581Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0419964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0420200Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0420566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0420701Z     kernel = self.compile(
2025-05-07T20:33:33.0421114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0421297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0421429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0421434Z 
2025-05-07T20:33:33.0421654Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd9f2760>
2025-05-07T20:33:33.0422504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0423055Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd9d6ee0>}
2025-05-07T20:33:33.0423876Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0424073Z context = <triton._C.libtriton.ir.context object at 0x7f58fdce29f0>
2025-05-07T20:33:33.0424077Z 
2025-05-07T20:33:33.0424248Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0424527Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0424636Z                            module_map=module_map)
2025-05-07T20:33:33.0424803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0424900Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0424979Z E       ^
2025-05-07T20:33:33.0425361Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0425368Z 
2025-05-07T20:33:33.0425882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0425891Z 
2025-05-07T20:33:33.0425995Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0426229Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0426310Z     T=1,
2025-05-07T20:33:33.0426387Z     D=7168,
2025-05-07T20:33:33.0426470Z     scale_ub=None,
2025-05-07T20:33:33.0426561Z     contiguous=False,
2025-05-07T20:33:33.0426646Z     compiled=False,
2025-05-07T20:33:33.0426720Z )
2025-05-07T20:33:33.0426951Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0427164Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:33.0427169Z 
2025-05-07T20:33:33.0427245Z     @given(
2025-05-07T20:33:33.0427368Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0427469Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0427588Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0427711Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0427825Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0427902Z     )
2025-05-07T20:33:33.0428160Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0428252Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0428369Z         self,
2025-05-07T20:33:33.0428449Z         T: int,
2025-05-07T20:33:33.0428526Z         D: int,
2025-05-07T20:33:33.0428625Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0428720Z         contiguous: bool,
2025-05-07T20:33:33.0428815Z         compiled: bool,
2025-05-07T20:33:33.0428913Z     ) -> None:
2025-05-07T20:33:33.0429018Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0429108Z     
2025-05-07T20:33:33.0429323Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0429398Z     
2025-05-07T20:33:33.0429492Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0429620Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0429708Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0429878Z         x0 = x[:, :D]
2025-05-07T20:33:33.0429957Z         x1 = x[:, D:]
2025-05-07T20:33:33.0430030Z     
2025-05-07T20:33:33.0430115Z         if contiguous:
2025-05-07T20:33:33.0430206Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0430294Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0430369Z     
2025-05-07T20:33:33.0430459Z         if scale_ub is not None:
2025-05-07T20:33:33.0430564Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0430699Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0430775Z             )
2025-05-07T20:33:33.0430853Z         else:
2025-05-07T20:33:33.0430945Z             scale_ub_tensor = None
2025-05-07T20:33:33.0431021Z     
2025-05-07T20:33:33.0431152Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0431241Z             op = silu_mul_quant
2025-05-07T20:33:33.0431325Z             if compiled:
2025-05-07T20:33:33.0431427Z                 op = torch.compile(op)
2025-05-07T20:33:33.0431532Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0431603Z     
2025-05-07T20:33:33.0431696Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0431700Z 
2025-05-07T20:33:33.0431795Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0431930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0432027Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0432130Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0432671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0432765Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0433151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0433432Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0433792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0433886Z     kernel = self.compile(
2025-05-07T20:33:33.0434297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0434474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0434604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0434649Z 
2025-05-07T20:33:33.0434857Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdce5b80>
2025-05-07T20:33:33.0435706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0436254Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdcfd670>}
2025-05-07T20:33:33.0437105Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0437302Z context = <triton._C.libtriton.ir.context object at 0x7f58fd9cb3b0>
2025-05-07T20:33:33.0437308Z 
2025-05-07T20:33:33.0437474Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0437753Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0437901Z                            module_map=module_map)
2025-05-07T20:33:33.0438060Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0438159Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0438236Z E       ^
2025-05-07T20:33:33.0438618Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0438622Z 
2025-05-07T20:33:33.0439066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0439073Z 
2025-05-07T20:33:33.0439172Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0439401Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0439482Z     T=2048,
2025-05-07T20:33:33.0439556Z     D=7168,
2025-05-07T20:33:33.0439635Z     scale_ub=None,
2025-05-07T20:33:33.0439719Z     contiguous=False,
2025-05-07T20:33:33.0439802Z     compiled=True,
2025-05-07T20:33:33.0439875Z )
2025-05-07T20:33:33.0440098Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0440277Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:33.0440284Z 
2025-05-07T20:33:33.0440359Z     @given(
2025-05-07T20:33:33.0440475Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0440577Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0440690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0440808Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0440920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0440993Z     )
2025-05-07T20:33:33.0441254Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0441344Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0441418Z         self,
2025-05-07T20:33:33.0441494Z         T: int,
2025-05-07T20:33:33.0441566Z         D: int,
2025-05-07T20:33:33.0441667Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0441760Z         contiguous: bool,
2025-05-07T20:33:33.0441841Z         compiled: bool,
2025-05-07T20:33:33.0441914Z     ) -> None:
2025-05-07T20:33:33.0442060Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0442131Z     
2025-05-07T20:33:33.0442300Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0442377Z     
2025-05-07T20:33:33.0442465Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0442595Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0442683Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0442759Z         x0 = x[:, :D]
2025-05-07T20:33:33.0442836Z         x1 = x[:, D:]
2025-05-07T20:33:33.0442949Z     
2025-05-07T20:33:33.0443028Z         if contiguous:
2025-05-07T20:33:33.0443120Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0443208Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0443279Z     
2025-05-07T20:33:33.0443373Z         if scale_ub is not None:
2025-05-07T20:33:33.0443475Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0443609Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0443687Z             )
2025-05-07T20:33:33.0443758Z         else:
2025-05-07T20:33:33.0443857Z             scale_ub_tensor = None
2025-05-07T20:33:33.0443928Z     
2025-05-07T20:33:33.0444060Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0444150Z             op = silu_mul_quant
2025-05-07T20:33:33.0444272Z             if compiled:
2025-05-07T20:33:33.0444373Z                 op = torch.compile(op)
2025-05-07T20:33:33.0444476Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0444545Z     
2025-05-07T20:33:33.0444638Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0444643Z 
2025-05-07T20:33:33.0444742Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0444873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0445015Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0445115Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0445510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0445603Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0446146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0446242Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0446625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0446862Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0447226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0447319Z     kernel = self.compile(
2025-05-07T20:33:33.0447728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0447904Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0448038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0448042Z 
2025-05-07T20:33:33.0448251Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdc59610>
2025-05-07T20:33:33.0449131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0449697Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fdc6f550>}
2025-05-07T20:33:33.0450508Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0450748Z context = <triton._C.libtriton.ir.context object at 0x7f58fd8813b0>
2025-05-07T20:33:33.0450753Z 
2025-05-07T20:33:33.0450920Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0451199Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0451306Z                            module_map=module_map)
2025-05-07T20:33:33.0451468Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0451565Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0451681Z E       ^
2025-05-07T20:33:33.0452063Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0452068Z 
2025-05-07T20:33:33.0452510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0452518Z 
2025-05-07T20:33:33.0452617Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0452851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0452926Z     T=4096,
2025-05-07T20:33:33.0453001Z     D=7168,
2025-05-07T20:33:33.0453085Z     scale_ub=None,
2025-05-07T20:33:33.0453168Z     contiguous=False,
2025-05-07T20:33:33.0453251Z     compiled=True,
2025-05-07T20:33:33.0453322Z )
2025-05-07T20:33:33.0453584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0453765Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:33.0453773Z 
2025-05-07T20:33:33.0453847Z     @given(
2025-05-07T20:33:33.0453964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0454063Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0454176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0454351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0454464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0454540Z     )
2025-05-07T20:33:33.0454798Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0454891Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0454964Z         self,
2025-05-07T20:33:33.0455040Z         T: int,
2025-05-07T20:33:33.0455112Z         D: int,
2025-05-07T20:33:33.0455213Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0455301Z         contiguous: bool,
2025-05-07T20:33:33.0455384Z         compiled: bool,
2025-05-07T20:33:33.0455458Z     ) -> None:
2025-05-07T20:33:33.0455562Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0455636Z     
2025-05-07T20:33:33.0455803Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0455878Z     
2025-05-07T20:33:33.0455972Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0456098Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0456185Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0456264Z         x0 = x[:, :D]
2025-05-07T20:33:33.0456344Z         x1 = x[:, D:]
2025-05-07T20:33:33.0456413Z     
2025-05-07T20:33:33.0456493Z         if contiguous:
2025-05-07T20:33:33.0456587Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0456675Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0456746Z     
2025-05-07T20:33:33.0456837Z         if scale_ub is not None:
2025-05-07T20:33:33.0456941Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0457075Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0457155Z             )
2025-05-07T20:33:33.0457231Z         else:
2025-05-07T20:33:33.0457327Z             scale_ub_tensor = None
2025-05-07T20:33:33.0457396Z     
2025-05-07T20:33:33.0457522Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0457614Z             op = silu_mul_quant
2025-05-07T20:33:33.0457696Z             if compiled:
2025-05-07T20:33:33.0457794Z                 op = torch.compile(op)
2025-05-07T20:33:33.0457944Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0458018Z     
2025-05-07T20:33:33.0458107Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0458111Z 
2025-05-07T20:33:33.0458206Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0458335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0458435Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0458535Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0458930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0459096Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0459662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0459759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0460142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0460377Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0460744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0460835Z     kernel = self.compile(
2025-05-07T20:33:33.0461281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0461463Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0461591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0461596Z 
2025-05-07T20:33:33.0461803Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd84d4c0>
2025-05-07T20:33:33.0462692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0463232Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd965160>}
2025-05-07T20:33:33.0464046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0464240Z context = <triton._C.libtriton.ir.context object at 0x7f58fd96c730>
2025-05-07T20:33:33.0464247Z 
2025-05-07T20:33:33.0464416Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0464691Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0464798Z                            module_map=module_map)
2025-05-07T20:33:33.0464962Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0465061Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0465135Z E       ^
2025-05-07T20:33:33.0465518Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0465522Z 
2025-05-07T20:33:33.0465967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0465972Z 
2025-05-07T20:33:33.0466076Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0466307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0466385Z     T=16384,
2025-05-07T20:33:33.0466461Z     D=5120,
2025-05-07T20:33:33.0466542Z     scale_ub=1200.0,
2025-05-07T20:33:33.0466628Z     contiguous=False,
2025-05-07T20:33:33.0466716Z     compiled=False,
2025-05-07T20:33:33.0466787Z )
2025-05-07T20:33:33.0467013Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0467248Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:33.0467253Z 
2025-05-07T20:33:33.0467330Z     @given(
2025-05-07T20:33:33.0467449Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0467544Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0467659Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0467779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0467889Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0468001Z     )
2025-05-07T20:33:33.0468258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0468349Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0468425Z         self,
2025-05-07T20:33:33.0468502Z         T: int,
2025-05-07T20:33:33.0468576Z         D: int,
2025-05-07T20:33:33.0468677Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0468761Z         contiguous: bool,
2025-05-07T20:33:33.0468847Z         compiled: bool,
2025-05-07T20:33:33.0468926Z     ) -> None:
2025-05-07T20:33:33.0469019Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0469090Z     
2025-05-07T20:33:33.0469266Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0469339Z     
2025-05-07T20:33:33.0469428Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0469596Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0469684Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0469839Z         x0 = x[:, :D]
2025-05-07T20:33:33.0469921Z         x1 = x[:, D:]
2025-05-07T20:33:33.0469991Z     
2025-05-07T20:33:33.0470075Z         if contiguous:
2025-05-07T20:33:33.0470165Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0470255Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0470376Z     
2025-05-07T20:33:33.0470464Z         if scale_ub is not None:
2025-05-07T20:33:33.0470567Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0470707Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0470780Z             )
2025-05-07T20:33:33.0470855Z         else:
2025-05-07T20:33:33.0470951Z             scale_ub_tensor = None
2025-05-07T20:33:33.0471022Z     
2025-05-07T20:33:33.0471148Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0471244Z             op = silu_mul_quant
2025-05-07T20:33:33.0471326Z             if compiled:
2025-05-07T20:33:33.0471428Z                 op = torch.compile(op)
2025-05-07T20:33:33.0471530Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0471602Z     
2025-05-07T20:33:33.0471692Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0471696Z 
2025-05-07T20:33:33.0471791Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0471920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0472026Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0472124Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0472669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0472765Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0473148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0473385Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0473746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0473837Z     kernel = self.compile(
2025-05-07T20:33:33.0474247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0474426Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0474556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0474560Z 
2025-05-07T20:33:33.0474812Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd857640>
2025-05-07T20:33:33.0475656Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0476201Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd965940>}
2025-05-07T20:33:33.0477053Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0477256Z context = <triton._C.libtriton.ir.context object at 0x7f58fd7e0770>
2025-05-07T20:33:33.0477261Z 
2025-05-07T20:33:33.0477431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0477707Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0477810Z                            module_map=module_map)
2025-05-07T20:33:33.0477971Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0478112Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0478190Z E       ^
2025-05-07T20:33:33.0478569Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0478576Z 
2025-05-07T20:33:33.0479022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0479027Z 
2025-05-07T20:33:33.0479170Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0479404Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0479483Z     T=16384,
2025-05-07T20:33:33.0479560Z     D=5120,
2025-05-07T20:33:33.0479646Z     scale_ub=1200.0,
2025-05-07T20:33:33.0479731Z     contiguous=True,
2025-05-07T20:33:33.0479810Z     compiled=True,
2025-05-07T20:33:33.0479888Z )
2025-05-07T20:33:33.0480110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0480292Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:33.0480296Z 
2025-05-07T20:33:33.0480376Z     @given(
2025-05-07T20:33:33.0480492Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0480592Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0480707Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0480821Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0480939Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0481013Z     )
2025-05-07T20:33:33.0481273Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0481372Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0481444Z         self,
2025-05-07T20:33:33.0481517Z         T: int,
2025-05-07T20:33:33.0481595Z         D: int,
2025-05-07T20:33:33.0481689Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0481775Z         contiguous: bool,
2025-05-07T20:33:33.0481864Z         compiled: bool,
2025-05-07T20:33:33.0481942Z     ) -> None:
2025-05-07T20:33:33.0482039Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0482112Z     
2025-05-07T20:33:33.0482281Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0482360Z     
2025-05-07T20:33:33.0482450Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0482572Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0482661Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0482929Z         x0 = x[:, :D]
2025-05-07T20:33:33.0483050Z         x1 = x[:, D:]
2025-05-07T20:33:33.0483156Z     
2025-05-07T20:33:33.0483239Z         if contiguous:
2025-05-07T20:33:33.0483414Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0483508Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0483580Z     
2025-05-07T20:33:33.0483674Z         if scale_ub is not None:
2025-05-07T20:33:33.0483782Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0483928Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0484004Z             )
2025-05-07T20:33:33.0484078Z         else:
2025-05-07T20:33:33.0484172Z             scale_ub_tensor = None
2025-05-07T20:33:33.0484304Z     
2025-05-07T20:33:33.0484430Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0484518Z             op = silu_mul_quant
2025-05-07T20:33:33.0484602Z             if compiled:
2025-05-07T20:33:33.0484698Z                 op = torch.compile(op)
2025-05-07T20:33:33.0484801Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0484871Z     
2025-05-07T20:33:33.0484956Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0484961Z 
2025-05-07T20:33:33.0485062Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0485190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0485287Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0485383Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0485865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0485957Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0486496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0486596Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0486979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0487279Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0487641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0487732Z     kernel = self.compile(
2025-05-07T20:33:33.0488140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0488320Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0488451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0488455Z 
2025-05-07T20:33:33.0488663Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fdd098b0>
2025-05-07T20:33:33.0489561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0490107Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fda0d550>}
2025-05-07T20:33:33.0490918Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0491111Z context = <triton._C.libtriton.ir.context object at 0x7f58fda446f0>
2025-05-07T20:33:33.0491116Z 
2025-05-07T20:33:33.0491280Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0491559Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0491661Z                            module_map=module_map)
2025-05-07T20:33:33.0491827Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0491921Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0491992Z E       ^
2025-05-07T20:33:33.0492416Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0492421Z 
2025-05-07T20:33:33.0492864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0492869Z 
2025-05-07T20:33:33.0492970Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0493199Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0493275Z     T=16384,
2025-05-07T20:33:33.0493347Z     D=5120,
2025-05-07T20:33:33.0493464Z     scale_ub=None,
2025-05-07T20:33:33.0493548Z     contiguous=False,
2025-05-07T20:33:33.0493634Z     compiled=True,
2025-05-07T20:33:33.0493704Z )
2025-05-07T20:33:33.0493926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0494110Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:33.0494115Z 
2025-05-07T20:33:33.0494188Z     @given(
2025-05-07T20:33:33.0494306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0494406Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0494517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0494633Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0494781Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0494853Z     )
2025-05-07T20:33:33.0495110Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0495200Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0495272Z         self,
2025-05-07T20:33:33.0495348Z         T: int,
2025-05-07T20:33:33.0495421Z         D: int,
2025-05-07T20:33:33.0495516Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0495647Z         contiguous: bool,
2025-05-07T20:33:33.0495728Z         compiled: bool,
2025-05-07T20:33:33.0495802Z     ) -> None:
2025-05-07T20:33:33.0495896Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0495964Z     
2025-05-07T20:33:33.0496139Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0496210Z     
2025-05-07T20:33:33.0496298Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0496420Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0496505Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0496582Z         x0 = x[:, :D]
2025-05-07T20:33:33.0496661Z         x1 = x[:, D:]
2025-05-07T20:33:33.0496731Z     
2025-05-07T20:33:33.0496812Z         if contiguous:
2025-05-07T20:33:33.0496907Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0496993Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0497063Z     
2025-05-07T20:33:33.0497152Z         if scale_ub is not None:
2025-05-07T20:33:33.0497253Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0497391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0497463Z             )
2025-05-07T20:33:33.0497532Z         else:
2025-05-07T20:33:33.0497625Z             scale_ub_tensor = None
2025-05-07T20:33:33.0497697Z     
2025-05-07T20:33:33.0497824Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0497915Z             op = silu_mul_quant
2025-05-07T20:33:33.0497994Z             if compiled:
2025-05-07T20:33:33.0498092Z                 op = torch.compile(op)
2025-05-07T20:33:33.0498197Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0498268Z     
2025-05-07T20:33:33.0498354Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0498358Z 
2025-05-07T20:33:33.0498459Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0498589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0498691Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0498788Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0499179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0499318Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0499855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0499948Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0500332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0500565Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0500927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0501054Z     kernel = self.compile(
2025-05-07T20:33:33.0501460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0501644Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0501775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0501780Z 
2025-05-07T20:33:33.0501993Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd8d5d00>
2025-05-07T20:33:33.0502881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0503424Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd7b71f0>}
2025-05-07T20:33:33.0504236Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0504469Z context = <triton._C.libtriton.ir.context object at 0x7f58fd7aa6b0>
2025-05-07T20:33:33.0504473Z 
2025-05-07T20:33:33.0504642Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0504914Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0505016Z                            module_map=module_map)
2025-05-07T20:33:33.0505179Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0505274Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0505347Z E       ^
2025-05-07T20:33:33.0505726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0505733Z 
2025-05-07T20:33:33.0506176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0506184Z 
2025-05-07T20:33:33.0506285Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0506514Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0506586Z     T=2048,
2025-05-07T20:33:33.0506664Z     D=5120,
2025-05-07T20:33:33.0506742Z     scale_ub=None,
2025-05-07T20:33:33.0506830Z     contiguous=False,
2025-05-07T20:33:33.0506907Z     compiled=True,
2025-05-07T20:33:33.0506975Z )
2025-05-07T20:33:33.0507198Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0507379Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:33.0507383Z 
2025-05-07T20:33:33.0507455Z     @given(
2025-05-07T20:33:33.0507580Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0507676Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0507785Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0507900Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0508010Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0508085Z     )
2025-05-07T20:33:33.0508385Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0508478Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0508557Z         self,
2025-05-07T20:33:33.0508628Z         T: int,
2025-05-07T20:33:33.0508700Z         D: int,
2025-05-07T20:33:33.0508800Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0508881Z         contiguous: bool,
2025-05-07T20:33:33.0508968Z         compiled: bool,
2025-05-07T20:33:33.0509048Z     ) -> None:
2025-05-07T20:33:33.0509139Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0509209Z     
2025-05-07T20:33:33.0513071Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0513162Z     
2025-05-07T20:33:33.0513259Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0513393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0513486Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0513570Z         x0 = x[:, :D]
2025-05-07T20:33:33.0513650Z         x1 = x[:, D:]
2025-05-07T20:33:33.0513724Z     
2025-05-07T20:33:33.0513813Z         if contiguous:
2025-05-07T20:33:33.0513905Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0513995Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0514075Z     
2025-05-07T20:33:33.0514165Z         if scale_ub is not None:
2025-05-07T20:33:33.0514273Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0514478Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0514556Z             )
2025-05-07T20:33:33.0514633Z         else:
2025-05-07T20:33:33.0514736Z             scale_ub_tensor = None
2025-05-07T20:33:33.0514810Z     
2025-05-07T20:33:33.0514944Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0515040Z             op = silu_mul_quant
2025-05-07T20:33:33.0515127Z             if compiled:
2025-05-07T20:33:33.0515276Z                 op = torch.compile(op)
2025-05-07T20:33:33.0515381Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0515454Z     
2025-05-07T20:33:33.0515550Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0515555Z 
2025-05-07T20:33:33.0515652Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0515786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0515896Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0516002Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0516409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0516510Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0517053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0517153Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0517539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0517775Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0518146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0518238Z     kernel = self.compile(
2025-05-07T20:33:33.0518655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0518862Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0519017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0519025Z 
2025-05-07T20:33:33.0519244Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd78d550>
2025-05-07T20:33:33.0520097Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0520698Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd7b7f70>}
2025-05-07T20:33:33.0521521Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0521719Z context = <triton._C.libtriton.ir.context object at 0x7f58fd74ad30>
2025-05-07T20:33:33.0521723Z 
2025-05-07T20:33:33.0521962Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0522241Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0522353Z                            module_map=module_map)
2025-05-07T20:33:33.0522521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0522618Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0522700Z E       ^
2025-05-07T20:33:33.0523091Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0523096Z 
2025-05-07T20:33:33.0523548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0523552Z 
2025-05-07T20:33:33.0523706Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0523939Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0524025Z     T=2048,
2025-05-07T20:33:33.0524103Z     D=5120,
2025-05-07T20:33:33.0524188Z     scale_ub=1200.0,
2025-05-07T20:33:33.0524281Z     contiguous=False,
2025-05-07T20:33:33.0524365Z     compiled=True,
2025-05-07T20:33:33.0524439Z )
2025-05-07T20:33:33.0524714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0524897Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:33.0524901Z 
2025-05-07T20:33:33.0524984Z     @given(
2025-05-07T20:33:33.0525103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0525200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0525319Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0525439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0525552Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0525631Z     )
2025-05-07T20:33:33.0525894Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0525991Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0526069Z         self,
2025-05-07T20:33:33.0526148Z         T: int,
2025-05-07T20:33:33.0526224Z         D: int,
2025-05-07T20:33:33.0526326Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0526421Z         contiguous: bool,
2025-05-07T20:33:33.0526506Z         compiled: bool,
2025-05-07T20:33:33.0526586Z     ) -> None:
2025-05-07T20:33:33.0526688Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0526766Z     
2025-05-07T20:33:33.0526941Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0527019Z     
2025-05-07T20:33:33.0527112Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0527237Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0527331Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0527412Z         x0 = x[:, :D]
2025-05-07T20:33:33.0527498Z         x1 = x[:, D:]
2025-05-07T20:33:33.0527571Z     
2025-05-07T20:33:33.0527656Z         if contiguous:
2025-05-07T20:33:33.0527753Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0527842Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0527916Z     
2025-05-07T20:33:33.0528015Z         if scale_ub is not None:
2025-05-07T20:33:33.0528125Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0528261Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0528342Z             )
2025-05-07T20:33:33.0528465Z         else:
2025-05-07T20:33:33.0528561Z             scale_ub_tensor = None
2025-05-07T20:33:33.0528635Z     
2025-05-07T20:33:33.0528766Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0528858Z             op = silu_mul_quant
2025-05-07T20:33:33.0528943Z             if compiled:
2025-05-07T20:33:33.0529067Z                 op = torch.compile(op)
2025-05-07T20:33:33.0529185Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0529275Z     
2025-05-07T20:33:33.0529366Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0529411Z 
2025-05-07T20:33:33.0529513Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0529647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0529752Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0529856Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0530255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0530353Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0530893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0530989Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0531414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0531651Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0532019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0532114Z     kernel = self.compile(
2025-05-07T20:33:33.0532526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0532748Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0532881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0532886Z 
2025-05-07T20:33:33.0533098Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd71b610>
2025-05-07T20:33:33.0533951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0534496Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd738940>}
2025-05-07T20:33:33.0535316Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0535515Z context = <triton._C.libtriton.ir.context object at 0x7f58fd633730>
2025-05-07T20:33:33.0535521Z 
2025-05-07T20:33:33.0535694Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0535971Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0536077Z                            module_map=module_map)
2025-05-07T20:33:33.0536246Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0536343Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0536420Z E       ^
2025-05-07T20:33:33.0536810Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0536815Z 
2025-05-07T20:33:33.0537263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0537271Z 
2025-05-07T20:33:33.0537379Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0537652Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0537730Z     T=4096,
2025-05-07T20:33:33.0537811Z     D=5120,
2025-05-07T20:33:33.0537894Z     scale_ub=1200.0,
2025-05-07T20:33:33.0537979Z     contiguous=True,
2025-05-07T20:33:33.0538067Z     compiled=True,
2025-05-07T20:33:33.0538144Z )
2025-05-07T20:33:33.0538372Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0538554Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:33.0538559Z 
2025-05-07T20:33:33.0538680Z     @given(
2025-05-07T20:33:33.0538807Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0538907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0539047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0539191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0539316Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0539393Z     )
2025-05-07T20:33:33.0539663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0539758Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0539836Z         self,
2025-05-07T20:33:33.0539915Z         T: int,
2025-05-07T20:33:33.0539990Z         D: int,
2025-05-07T20:33:33.0540134Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0540225Z         contiguous: bool,
2025-05-07T20:33:33.0540310Z         compiled: bool,
2025-05-07T20:33:33.0540392Z     ) -> None:
2025-05-07T20:33:33.0540486Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0540564Z     
2025-05-07T20:33:33.0540739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0540813Z     
2025-05-07T20:33:33.0540906Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0541082Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0541172Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0541253Z         x0 = x[:, :D]
2025-05-07T20:33:33.0541337Z         x1 = x[:, D:]
2025-05-07T20:33:33.0541411Z     
2025-05-07T20:33:33.0541497Z         if contiguous:
2025-05-07T20:33:33.0541588Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0541678Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0541755Z     
2025-05-07T20:33:33.0541849Z         if scale_ub is not None:
2025-05-07T20:33:33.0541958Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0542098Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0542174Z             )
2025-05-07T20:33:33.0542253Z         else:
2025-05-07T20:33:33.0542355Z             scale_ub_tensor = None
2025-05-07T20:33:33.0542432Z     
2025-05-07T20:33:33.0542563Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0542656Z             op = silu_mul_quant
2025-05-07T20:33:33.0542745Z             if compiled:
2025-05-07T20:33:33.0542847Z                 op = torch.compile(op)
2025-05-07T20:33:33.0542958Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0543033Z     
2025-05-07T20:33:33.0543128Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0543133Z 
2025-05-07T20:33:33.0543231Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0543364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0543471Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0543573Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0543969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0544068Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0544607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0544713Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0545096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0545381Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0545750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0545844Z     kernel = self.compile(
2025-05-07T20:33:33.0546259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0546443Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0546575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0546619Z 
2025-05-07T20:33:33.0546832Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd640520>
2025-05-07T20:33:33.0547681Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0548235Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd705790>}
2025-05-07T20:33:33.0549093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0549291Z context = <triton._C.libtriton.ir.context object at 0x7f58fd696570>
2025-05-07T20:33:33.0549297Z 
2025-05-07T20:33:33.0549470Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0549747Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0549975Z                            module_map=module_map)
2025-05-07T20:33:33.0550137Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0550238Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0550321Z E       ^
2025-05-07T20:33:33.0550705Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0550709Z 
2025-05-07T20:33:33.0551159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0551166Z 
2025-05-07T20:33:33.0551268Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0551500Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0551583Z     T=128,
2025-05-07T20:33:33.0551660Z     D=5120,
2025-05-07T20:33:33.0551742Z     scale_ub=1200.0,
2025-05-07T20:33:33.0551830Z     contiguous=False,
2025-05-07T20:33:33.0551917Z     compiled=True,
2025-05-07T20:33:33.0551991Z )
2025-05-07T20:33:33.0552219Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0552399Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:33.0552404Z 
2025-05-07T20:33:33.0552484Z     @given(
2025-05-07T20:33:33.0552606Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0552706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0552825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0552947Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0553060Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0553137Z     )
2025-05-07T20:33:33.0553399Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0553493Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0553572Z         self,
2025-05-07T20:33:33.0553647Z         T: int,
2025-05-07T20:33:33.0553726Z         D: int,
2025-05-07T20:33:33.0553826Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0553915Z         contiguous: bool,
2025-05-07T20:33:33.0554002Z         compiled: bool,
2025-05-07T20:33:33.0554154Z     ) -> None:
2025-05-07T20:33:33.0554248Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0554324Z     
2025-05-07T20:33:33.0554495Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0554569Z     
2025-05-07T20:33:33.0554662Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0554791Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0554882Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0554969Z         x0 = x[:, :D]
2025-05-07T20:33:33.0555048Z         x1 = x[:, D:]
2025-05-07T20:33:33.0555162Z     
2025-05-07T20:33:33.0555248Z         if contiguous:
2025-05-07T20:33:33.0555340Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0555432Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0555510Z     
2025-05-07T20:33:33.0555603Z         if scale_ub is not None:
2025-05-07T20:33:33.0555712Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0555848Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0555925Z             )
2025-05-07T20:33:33.0556003Z         else:
2025-05-07T20:33:33.0556099Z             scale_ub_tensor = None
2025-05-07T20:33:33.0556172Z     
2025-05-07T20:33:33.0556304Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0556394Z             op = silu_mul_quant
2025-05-07T20:33:33.0556517Z             if compiled:
2025-05-07T20:33:33.0556622Z                 op = torch.compile(op)
2025-05-07T20:33:33.0556727Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0556802Z     
2025-05-07T20:33:33.0556896Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0556900Z 
2025-05-07T20:33:33.0556998Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0557134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0557277Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0557377Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0557782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0557876Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0558419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0558526Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0558921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0559163Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0559530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0559625Z     kernel = self.compile(
2025-05-07T20:33:33.0560042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0560223Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0560360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0560365Z 
2025-05-07T20:33:33.0560575Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd5a8580>
2025-05-07T20:33:33.0561426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0561981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd5cc0d0>}
2025-05-07T20:33:33.0562795Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0563039Z context = <triton._C.libtriton.ir.context object at 0x7f58fd5ae2b0>
2025-05-07T20:33:33.0563044Z 
2025-05-07T20:33:33.0563216Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0563492Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0563603Z                            module_map=module_map)
2025-05-07T20:33:33.0563765Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0563865Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0563988Z E       ^
2025-05-07T20:33:33.0564373Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0564378Z 
2025-05-07T20:33:33.0564832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0564837Z 
2025-05-07T20:33:33.0564939Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0565177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0565256Z     T=16384,
2025-05-07T20:33:33.0565334Z     D=7168,
2025-05-07T20:33:33.0565421Z     scale_ub=1200.0,
2025-05-07T20:33:33.0565511Z     contiguous=True,
2025-05-07T20:33:33.0565595Z     compiled=True,
2025-05-07T20:33:33.0565712Z )
2025-05-07T20:33:33.0565943Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0566123Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:33.0566130Z 
2025-05-07T20:33:33.0566209Z     @given(
2025-05-07T20:33:33.0566330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0566431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0566589Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0566705Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0566824Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0566899Z     )
2025-05-07T20:33:33.0567161Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0567260Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0567336Z         self,
2025-05-07T20:33:33.0567413Z         T: int,
2025-05-07T20:33:33.0567496Z         D: int,
2025-05-07T20:33:33.0567598Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0567687Z         contiguous: bool,
2025-05-07T20:33:33.0567775Z         compiled: bool,
2025-05-07T20:33:33.0567856Z     ) -> None:
2025-05-07T20:33:33.0567955Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0568028Z     
2025-05-07T20:33:33.0568200Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0568280Z     
2025-05-07T20:33:33.0568371Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0568496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0568590Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0568673Z         x0 = x[:, :D]
2025-05-07T20:33:33.0568753Z         x1 = x[:, D:]
2025-05-07T20:33:33.0568832Z     
2025-05-07T20:33:33.0568916Z         if contiguous:
2025-05-07T20:33:33.0569030Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0569132Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0569226Z     
2025-05-07T20:33:33.0569321Z         if scale_ub is not None:
2025-05-07T20:33:33.0569431Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0569567Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0569650Z             )
2025-05-07T20:33:33.0569725Z         else:
2025-05-07T20:33:33.0569817Z             scale_ub_tensor = None
2025-05-07T20:33:33.0569894Z     
2025-05-07T20:33:33.0570024Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0570115Z             op = silu_mul_quant
2025-05-07T20:33:33.0570206Z             if compiled:
2025-05-07T20:33:33.0570306Z                 op = torch.compile(op)
2025-05-07T20:33:33.0570460Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0570535Z     
2025-05-07T20:33:33.0570624Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0570628Z 
2025-05-07T20:33:33.0570729Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0570860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0570962Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0571062Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0571452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0571583Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0572120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0572216Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0572603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0572835Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0573195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0573295Z     kernel = self.compile(
2025-05-07T20:33:33.0573740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0573916Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0574060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0574064Z 
2025-05-07T20:33:33.0574274Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd63c190>
2025-05-07T20:33:33.0575163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0575710Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd5ccd30>}
2025-05-07T20:33:33.0576522Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0576719Z context = <triton._C.libtriton.ir.context object at 0x7f58fd5231b0>
2025-05-07T20:33:33.0576726Z 
2025-05-07T20:33:33.0576894Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0577168Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0577276Z                            module_map=module_map)
2025-05-07T20:33:33.0577437Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0577540Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0577618Z E       ^
2025-05-07T20:33:33.0577997Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0578002Z 
2025-05-07T20:33:33.0578451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0578456Z 
2025-05-07T20:33:33.0578554Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0578790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0578864Z     T=16384,
2025-05-07T20:33:33.0578938Z     D=5120,
2025-05-07T20:33:33.0579043Z     scale_ub=1200.0,
2025-05-07T20:33:33.0579130Z     contiguous=True,
2025-05-07T20:33:33.0579228Z     compiled=False,
2025-05-07T20:33:33.0579307Z )
2025-05-07T20:33:33.0579530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0579753Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:33.0579758Z 
2025-05-07T20:33:33.0579837Z     @given(
2025-05-07T20:33:33.0579952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0580054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0580168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0580283Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0580397Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0580509Z     )
2025-05-07T20:33:33.0580765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0580862Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0580935Z         self,
2025-05-07T20:33:33.0581010Z         T: int,
2025-05-07T20:33:33.0581086Z         D: int,
2025-05-07T20:33:33.0581183Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0581269Z         contiguous: bool,
2025-05-07T20:33:33.0581358Z         compiled: bool,
2025-05-07T20:33:33.0581434Z     ) -> None:
2025-05-07T20:33:33.0581529Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0581603Z     
2025-05-07T20:33:33.0581772Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0581844Z     
2025-05-07T20:33:33.0581975Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0582098Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0582189Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0582270Z         x0 = x[:, :D]
2025-05-07T20:33:33.0582346Z         x1 = x[:, D:]
2025-05-07T20:33:33.0582423Z     
2025-05-07T20:33:33.0582504Z         if contiguous:
2025-05-07T20:33:33.0582593Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0582947Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0583060Z     
2025-05-07T20:33:33.0583188Z         if scale_ub is not None:
2025-05-07T20:33:33.0583320Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0583458Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0583537Z             )
2025-05-07T20:33:33.0583611Z         else:
2025-05-07T20:33:33.0583703Z             scale_ub_tensor = None
2025-05-07T20:33:33.0583783Z     
2025-05-07T20:33:33.0583911Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0584002Z             op = silu_mul_quant
2025-05-07T20:33:33.0584087Z             if compiled:
2025-05-07T20:33:33.0584186Z                 op = torch.compile(op)
2025-05-07T20:33:33.0584292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0584370Z     
2025-05-07T20:33:33.0584458Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0584463Z 
2025-05-07T20:33:33.0584559Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0584692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0584792Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0584891Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0585434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0585531Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0585920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0586152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0586521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0586616Z     kernel = self.compile(
2025-05-07T20:33:33.0587024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0587206Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0587335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0587430Z 
2025-05-07T20:33:33.0587645Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd434280>
2025-05-07T20:33:33.0588497Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0589085Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd7fe700>}
2025-05-07T20:33:33.0590069Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0590269Z context = <triton._C.libtriton.ir.context object at 0x7f58fd418530>
2025-05-07T20:33:33.0590274Z 
2025-05-07T20:33:33.0590448Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0590724Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0590829Z                            module_map=module_map)
2025-05-07T20:33:33.0590996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0591153Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0591229Z E       ^
2025-05-07T20:33:33.0591613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0591620Z 
2025-05-07T20:33:33.0592065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0592126Z 
2025-05-07T20:33:33.0592229Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0592456Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0592531Z     T=1,
2025-05-07T20:33:33.0592610Z     D=7168,
2025-05-07T20:33:33.0592689Z     scale_ub=1200.0,
2025-05-07T20:33:33.0592770Z     contiguous=False,
2025-05-07T20:33:33.0592853Z     compiled=False,
2025-05-07T20:33:33.0592920Z )
2025-05-07T20:33:33.0593144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0593319Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:33.0593324Z 
2025-05-07T20:33:33.0593398Z     @given(
2025-05-07T20:33:33.0593514Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0593613Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0593725Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0593841Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0593952Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0594026Z     )
2025-05-07T20:33:33.0594283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0594373Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0594451Z         self,
2025-05-07T20:33:33.0594522Z         T: int,
2025-05-07T20:33:33.0594595Z         D: int,
2025-05-07T20:33:33.0594694Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0594778Z         contiguous: bool,
2025-05-07T20:33:33.0594861Z         compiled: bool,
2025-05-07T20:33:33.0594937Z     ) -> None:
2025-05-07T20:33:33.0595026Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0595098Z     
2025-05-07T20:33:33.0595271Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0595349Z     
2025-05-07T20:33:33.0595437Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0595560Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0595649Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0595730Z         x0 = x[:, :D]
2025-05-07T20:33:33.0595803Z         x1 = x[:, D:]
2025-05-07T20:33:33.0595874Z     
2025-05-07T20:33:33.0596004Z         if contiguous:
2025-05-07T20:33:33.0596095Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0596179Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0596255Z     
2025-05-07T20:33:33.0596343Z         if scale_ub is not None:
2025-05-07T20:33:33.0596446Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0596586Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0596656Z             )
2025-05-07T20:33:33.0596729Z         else:
2025-05-07T20:33:33.0596824Z             scale_ub_tensor = None
2025-05-07T20:33:33.0596946Z     
2025-05-07T20:33:33.0597071Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0597162Z             op = silu_mul_quant
2025-05-07T20:33:33.0597243Z             if compiled:
2025-05-07T20:33:33.0597346Z                 op = torch.compile(op)
2025-05-07T20:33:33.0597447Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0597517Z     
2025-05-07T20:33:33.0597608Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0597619Z 
2025-05-07T20:33:33.0597711Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0597840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0597940Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0598035Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0598618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0598713Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0599094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0599328Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0599726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0599816Z     kernel = self.compile(
2025-05-07T20:33:33.0600228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0600403Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0600531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0600539Z 
2025-05-07T20:33:33.0600749Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd9431c0>
2025-05-07T20:33:33.0601590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0602135Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd54f0d0>}
2025-05-07T20:33:33.0602950Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0603150Z context = <triton._C.libtriton.ir.context object at 0x7f58fd552c30>
2025-05-07T20:33:33.0603155Z 
2025-05-07T20:33:33.0603325Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0603602Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0603707Z                            module_map=module_map)
2025-05-07T20:33:33.0603867Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0603964Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0604033Z E       ^
2025-05-07T20:33:33.0604414Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0604419Z 
2025-05-07T20:33:33.0604906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0604911Z 
2025-05-07T20:33:33.0605011Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0605245Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0605317Z     T=4096,
2025-05-07T20:33:33.0605392Z     D=7168,
2025-05-07T20:33:33.0605473Z     scale_ub=1200.0,
2025-05-07T20:33:33.0605556Z     contiguous=False,
2025-05-07T20:33:33.0605633Z     compiled=True,
2025-05-07T20:33:33.0605744Z )
2025-05-07T20:33:33.0605966Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0606142Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:33.0606149Z 
2025-05-07T20:33:33.0606224Z     @given(
2025-05-07T20:33:33.0606339Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0606437Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0606552Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0606666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0606776Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0606847Z     )
2025-05-07T20:33:33.0607140Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0607240Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0607316Z         self,
2025-05-07T20:33:33.0607389Z         T: int,
2025-05-07T20:33:33.0607463Z         D: int,
2025-05-07T20:33:33.0607560Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0607648Z         contiguous: bool,
2025-05-07T20:33:33.0607731Z         compiled: bool,
2025-05-07T20:33:33.0607806Z     ) -> None:
2025-05-07T20:33:33.0607943Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0608014Z     
2025-05-07T20:33:33.0608182Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0608257Z     
2025-05-07T20:33:33.0608347Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0608469Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0608557Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0608633Z         x0 = x[:, :D]
2025-05-07T20:33:33.0608710Z         x1 = x[:, D:]
2025-05-07T20:33:33.0608783Z     
2025-05-07T20:33:33.0608865Z         if contiguous:
2025-05-07T20:33:33.0608953Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0609040Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0609113Z     
2025-05-07T20:33:33.0609205Z         if scale_ub is not None:
2025-05-07T20:33:33.0609308Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0609441Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0609514Z             )
2025-05-07T20:33:33.0609589Z         else:
2025-05-07T20:33:33.0609679Z             scale_ub_tensor = None
2025-05-07T20:33:33.0609748Z     
2025-05-07T20:33:33.0609875Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0609960Z             op = silu_mul_quant
2025-05-07T20:33:33.0610044Z             if compiled:
2025-05-07T20:33:33.0610142Z                 op = torch.compile(op)
2025-05-07T20:33:33.0610242Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0610314Z     
2025-05-07T20:33:33.0610401Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0610409Z 
2025-05-07T20:33:33.0610504Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0610632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0610731Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0610828Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0611218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0611314Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0611894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0611989Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0612369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0612598Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0612957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0613050Z     kernel = self.compile(
2025-05-07T20:33:33.0613495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0613671Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0613804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0613808Z 
2025-05-07T20:33:33.0614016Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd39dd90>
2025-05-07T20:33:33.0614864Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0615469Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd54fdc0>}
2025-05-07T20:33:33.0616280Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0616475Z context = <triton._C.libtriton.ir.context object at 0x7f58fd497d30>
2025-05-07T20:33:33.0616541Z 
2025-05-07T20:33:33.0616707Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0616987Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0617089Z                            module_map=module_map)
2025-05-07T20:33:33.0617251Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0617346Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0617415Z E       ^
2025-05-07T20:33:33.0617797Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0617802Z 
2025-05-07T20:33:33.0618242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0618248Z 
2025-05-07T20:33:33.0618346Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0618576Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0618651Z     T=128,
2025-05-07T20:33:33.0618726Z     D=7168,
2025-05-07T20:33:33.0618805Z     scale_ub=1200.0,
2025-05-07T20:33:33.0618887Z     contiguous=False,
2025-05-07T20:33:33.0618972Z     compiled=True,
2025-05-07T20:33:33.0619043Z )
2025-05-07T20:33:33.0619267Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0619444Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:33.0619449Z 
2025-05-07T20:33:33.0619522Z     @given(
2025-05-07T20:33:33.0619636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0619732Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0619844Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0619962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0620071Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0620143Z     )
2025-05-07T20:33:33.0620409Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0620501Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0620573Z         self,
2025-05-07T20:33:33.0620691Z         T: int,
2025-05-07T20:33:33.0620766Z         D: int,
2025-05-07T20:33:33.0620862Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0620953Z         contiguous: bool,
2025-05-07T20:33:33.0621034Z         compiled: bool,
2025-05-07T20:33:33.0621110Z     ) -> None:
2025-05-07T20:33:33.0621202Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0621273Z     
2025-05-07T20:33:33.0621442Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0621512Z     
2025-05-07T20:33:33.0621601Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0621765Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0621850Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0621927Z         x0 = x[:, :D]
2025-05-07T20:33:33.0622008Z         x1 = x[:, D:]
2025-05-07T20:33:33.0622079Z     
2025-05-07T20:33:33.0622155Z         if contiguous:
2025-05-07T20:33:33.0622247Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0622330Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0622400Z     
2025-05-07T20:33:33.0622489Z         if scale_ub is not None:
2025-05-07T20:33:33.0622590Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0622724Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0622793Z             )
2025-05-07T20:33:33.0622865Z         else:
2025-05-07T20:33:33.0622997Z             scale_ub_tensor = None
2025-05-07T20:33:33.0623069Z     
2025-05-07T20:33:33.0623194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0623282Z             op = silu_mul_quant
2025-05-07T20:33:33.0623360Z             if compiled:
2025-05-07T20:33:33.0623454Z                 op = torch.compile(op)
2025-05-07T20:33:33.0623558Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0623665Z     
2025-05-07T20:33:33.0623752Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0623756Z 
2025-05-07T20:33:33.0623850Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0623983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0624084Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0624178Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0624566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0624664Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0625199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0625295Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0625675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0625906Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0626273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0626364Z     kernel = self.compile(
2025-05-07T20:33:33.0626769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0626950Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0627078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0627083Z 
2025-05-07T20:33:33.0627296Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd3cdd60>
2025-05-07T20:33:33.0628142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0628686Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd466940>}
2025-05-07T20:33:33.0629589Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0629941Z context = <triton._C.libtriton.ir.context object at 0x7f58fd387a70>
2025-05-07T20:33:33.0629945Z 
2025-05-07T20:33:33.0630118Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0630390Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0630533Z                            module_map=module_map)
2025-05-07T20:33:33.0630696Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0630792Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0630874Z E       ^
2025-05-07T20:33:33.0631251Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0631256Z 
2025-05-07T20:33:33.0631702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0631707Z 
2025-05-07T20:33:33.0631807Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0632074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0632153Z     T=2048,
2025-05-07T20:33:33.0632227Z     D=7168,
2025-05-07T20:33:33.0632303Z     scale_ub=None,
2025-05-07T20:33:33.0632384Z     contiguous=True,
2025-05-07T20:33:33.0632466Z     compiled=True,
2025-05-07T20:33:33.0632536Z )
2025-05-07T20:33:33.0632761Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0632932Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:33.0632978Z 
2025-05-07T20:33:33.0633051Z     @given(
2025-05-07T20:33:33.0633172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0633271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0633381Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0633500Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0633608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0633684Z     )
2025-05-07T20:33:33.0633942Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0634031Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0634106Z         self,
2025-05-07T20:33:33.0634179Z         T: int,
2025-05-07T20:33:33.0634256Z         D: int,
2025-05-07T20:33:33.0634353Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0634438Z         contiguous: bool,
2025-05-07T20:33:33.0634520Z         compiled: bool,
2025-05-07T20:33:33.0634598Z     ) -> None:
2025-05-07T20:33:33.0634687Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0634756Z     
2025-05-07T20:33:33.0634929Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0635004Z     
2025-05-07T20:33:33.0635094Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0635215Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0635301Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0635382Z         x0 = x[:, :D]
2025-05-07T20:33:33.0635458Z         x1 = x[:, D:]
2025-05-07T20:33:33.0635531Z     
2025-05-07T20:33:33.0635614Z         if contiguous:
2025-05-07T20:33:33.0635705Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0635790Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0639707Z     
2025-05-07T20:33:33.0639818Z         if scale_ub is not None:
2025-05-07T20:33:33.0639925Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0640065Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0640143Z             )
2025-05-07T20:33:33.0640218Z         else:
2025-05-07T20:33:33.0640313Z             scale_ub_tensor = None
2025-05-07T20:33:33.0640385Z     
2025-05-07T20:33:33.0640583Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0640675Z             op = silu_mul_quant
2025-05-07T20:33:33.0640758Z             if compiled:
2025-05-07T20:33:33.0640863Z                 op = torch.compile(op)
2025-05-07T20:33:33.0640964Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0641035Z     
2025-05-07T20:33:33.0641131Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0641136Z 
2025-05-07T20:33:33.0641232Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0641365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0641513Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0641610Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0642009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0642108Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0642649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0642748Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0643131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0643403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0643770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0643863Z     kernel = self.compile(
2025-05-07T20:33:33.0644277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0644456Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0644627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0644632Z 
2025-05-07T20:33:33.0644852Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd6040d0>
2025-05-07T20:33:33.0645707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0646257Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd5f3550>}
2025-05-07T20:33:33.0647071Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0647267Z context = <triton._C.libtriton.ir.context object at 0x7f58fd31bcf0>
2025-05-07T20:33:33.0647274Z 
2025-05-07T20:33:33.0647446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0647725Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0647835Z                            module_map=module_map)
2025-05-07T20:33:33.0647997Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0648092Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0648179Z E       ^
2025-05-07T20:33:33.0648564Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0648569Z 
2025-05-07T20:33:33.0649021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0649026Z 
2025-05-07T20:33:33.0649126Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0649357Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0649432Z     T=16384,
2025-05-07T20:33:33.0649507Z     D=5120,
2025-05-07T20:33:33.0649584Z     scale_ub=None,
2025-05-07T20:33:33.0649715Z     contiguous=False,
2025-05-07T20:33:33.0649798Z     compiled=False,
2025-05-07T20:33:33.0649870Z )
2025-05-07T20:33:33.0650097Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0650278Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:33.0650285Z 
2025-05-07T20:33:33.0650361Z     @given(
2025-05-07T20:33:33.0650478Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0650572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0650733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0650849Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0650960Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0651037Z     )
2025-05-07T20:33:33.0651294Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0651383Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0651466Z         self,
2025-05-07T20:33:33.0651538Z         T: int,
2025-05-07T20:33:33.0651614Z         D: int,
2025-05-07T20:33:33.0651711Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0651796Z         contiguous: bool,
2025-05-07T20:33:33.0651879Z         compiled: bool,
2025-05-07T20:33:33.0651955Z     ) -> None:
2025-05-07T20:33:33.0652087Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0652162Z     
2025-05-07T20:33:33.0652335Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0652412Z     
2025-05-07T20:33:33.0652502Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0652623Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0654610Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0654658Z 
2025-05-07T20:33:33.0654777Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:33.0654787Z 
2025-05-07T20:33:33.0654887Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0655113Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0655189Z     T=4096,
2025-05-07T20:33:33.0655260Z     D=7168,
2025-05-07T20:33:33.0655339Z     scale_ub=1200.0,
2025-05-07T20:33:33.0655424Z     contiguous=True,
2025-05-07T20:33:33.0655511Z     compiled=True,
2025-05-07T20:33:33.0655584Z )
2025-05-07T20:33:33.0655811Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0655986Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:33.0655991Z 
2025-05-07T20:33:33.0656070Z     @given(
2025-05-07T20:33:33.0656187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0656283Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0656398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0656517Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0656628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0656706Z     )
2025-05-07T20:33:33.0656967Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0657056Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0657135Z         self,
2025-05-07T20:33:33.0657210Z         T: int,
2025-05-07T20:33:33.0657289Z         D: int,
2025-05-07T20:33:33.0657386Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0657472Z         contiguous: bool,
2025-05-07T20:33:33.0657562Z         compiled: bool,
2025-05-07T20:33:33.0657679Z     ) -> None:
2025-05-07T20:33:33.0657774Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0657847Z     
2025-05-07T20:33:33.0658015Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0658085Z     
2025-05-07T20:33:33.0658180Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0658304Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0660284Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0660333Z 
2025-05-07T20:33:33.0660449Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:33.0660454Z 
2025-05-07T20:33:33.0660553Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0660786Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0660860Z     T=16384,
2025-05-07T20:33:33.0660974Z     D=7168,
2025-05-07T20:33:33.0661057Z     scale_ub=None,
2025-05-07T20:33:33.0661139Z     contiguous=False,
2025-05-07T20:33:33.0661226Z     compiled=False,
2025-05-07T20:33:33.0661299Z )
2025-05-07T20:33:33.0661522Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0661704Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:33.0661748Z 
2025-05-07T20:33:33.0661824Z     @given(
2025-05-07T20:33:33.0661942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0662040Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0662153Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0662271Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0662380Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0662453Z     )
2025-05-07T20:33:33.0662717Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0662809Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0662883Z         self,
2025-05-07T20:33:33.0662962Z         T: int,
2025-05-07T20:33:33.0663035Z         D: int,
2025-05-07T20:33:33.0663131Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0663222Z         contiguous: bool,
2025-05-07T20:33:33.0663305Z         compiled: bool,
2025-05-07T20:33:33.0663379Z     ) -> None:
2025-05-07T20:33:33.0663476Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0663546Z     
2025-05-07T20:33:33.0663715Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0665685Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0665694Z 
2025-05-07T20:33:33.0665809Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0665813Z 
2025-05-07T20:33:33.0665911Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0666136Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0666214Z     T=2048,
2025-05-07T20:33:33.0666286Z     D=7168,
2025-05-07T20:33:33.0666366Z     scale_ub=1200.0,
2025-05-07T20:33:33.0666503Z     contiguous=True,
2025-05-07T20:33:33.0666590Z     compiled=True,
2025-05-07T20:33:33.0666662Z )
2025-05-07T20:33:33.0666889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0667064Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:33.0667069Z 
2025-05-07T20:33:33.0667153Z     @given(
2025-05-07T20:33:33.0667270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0667367Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0667522Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0667634Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0667742Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0667821Z     )
2025-05-07T20:33:33.0668079Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0668172Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0668246Z         self,
2025-05-07T20:33:33.0668319Z         T: int,
2025-05-07T20:33:33.0668398Z         D: int,
2025-05-07T20:33:33.0668494Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0668583Z         contiguous: bool,
2025-05-07T20:33:33.0668668Z         compiled: bool,
2025-05-07T20:33:33.0668742Z     ) -> None:
2025-05-07T20:33:33.0668897Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0668977Z     
2025-05-07T20:33:33.0669173Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0669259Z     
2025-05-07T20:33:33.0669357Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0669480Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0671541Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0671589Z 
2025-05-07T20:33:33.0671708Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:33.0671713Z 
2025-05-07T20:33:33.0671817Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0672043Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0672119Z     T=2048,
2025-05-07T20:33:33.0672194Z     D=7168,
2025-05-07T20:33:33.0672273Z     scale_ub=None,
2025-05-07T20:33:33.0672356Z     contiguous=True,
2025-05-07T20:33:33.0672441Z     compiled=False,
2025-05-07T20:33:33.0672517Z )
2025-05-07T20:33:33.0672739Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0672919Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:33.0672923Z 
2025-05-07T20:33:33.0672996Z     @given(
2025-05-07T20:33:33.0673117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0673213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0673322Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0673441Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0673549Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0673620Z     )
2025-05-07T20:33:33.0673881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0673974Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0674048Z         self,
2025-05-07T20:33:33.0674126Z         T: int,
2025-05-07T20:33:33.0674205Z         D: int,
2025-05-07T20:33:33.0674300Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0674388Z         contiguous: bool,
2025-05-07T20:33:33.0674472Z         compiled: bool,
2025-05-07T20:33:33.0674598Z     ) -> None:
2025-05-07T20:33:33.0674691Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0674763Z     
2025-05-07T20:33:33.0674935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0675003Z     
2025-05-07T20:33:33.0675094Z >       x_sign = torch.sign(x)
2025-05-07T20:33:33.0677045Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0677094Z 
2025-05-07T20:33:33.0677210Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:33.0677218Z 
2025-05-07T20:33:33.0677320Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0677545Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0677617Z     T=1,
2025-05-07T20:33:33.0677693Z     D=7168,
2025-05-07T20:33:33.0677770Z     scale_ub=1200.0,
2025-05-07T20:33:33.0677892Z     contiguous=True,
2025-05-07T20:33:33.0677977Z     compiled=False,
2025-05-07T20:33:33.0678045Z )
2025-05-07T20:33:33.0678271Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0678441Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:33.0678445Z 
2025-05-07T20:33:33.0678517Z     @given(
2025-05-07T20:33:33.0678640Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0678785Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0678916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0679049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0679177Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0679251Z     )
2025-05-07T20:33:33.0679506Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0679599Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0679675Z         self,
2025-05-07T20:33:33.0679753Z         T: int,
2025-05-07T20:33:33.0679827Z         D: int,
2025-05-07T20:33:33.0679926Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0680014Z         contiguous: bool,
2025-05-07T20:33:33.0680100Z         compiled: bool,
2025-05-07T20:33:33.0680179Z     ) -> None:
2025-05-07T20:33:33.0680270Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0680344Z     
2025-05-07T20:33:33.0680515Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0680592Z     
2025-05-07T20:33:33.0680683Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0680807Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0680894Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0680974Z         x0 = x[:, :D]
2025-05-07T20:33:33.0681056Z         x1 = x[:, D:]
2025-05-07T20:33:33.0681128Z     
2025-05-07T20:33:33.0681210Z         if contiguous:
2025-05-07T20:33:33.0681300Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0681393Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0681473Z     
2025-05-07T20:33:33.0681564Z         if scale_ub is not None:
2025-05-07T20:33:33.0681668Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0681808Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0681881Z             )
2025-05-07T20:33:33.0681956Z         else:
2025-05-07T20:33:33.0682049Z             scale_ub_tensor = None
2025-05-07T20:33:33.0682120Z     
2025-05-07T20:33:33.0682251Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0682339Z             op = silu_mul_quant
2025-05-07T20:33:33.0682424Z             if compiled:
2025-05-07T20:33:33.0682574Z                 op = torch.compile(op)
2025-05-07T20:33:33.0682679Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0683060Z     
2025-05-07T20:33:33.0683200Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0683207Z 
2025-05-07T20:33:33.0683310Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0683442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0683546Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0683642Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0684281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0684374Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0684759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0684999Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0685358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0685448Z     kernel = self.compile(
2025-05-07T20:33:33.0685922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0686099Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0686231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0686239Z 
2025-05-07T20:33:33.0686446Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd252880>
2025-05-07T20:33:33.0687291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0687905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd197040>}
2025-05-07T20:33:33.0688724Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0688922Z context = <triton._C.libtriton.ir.context object at 0x7f58fd18e070>
2025-05-07T20:33:33.0688927Z 
2025-05-07T20:33:33.0689098Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0689375Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0689478Z                            module_map=module_map)
2025-05-07T20:33:33.0689641Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0689745Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0689816Z E       ^
2025-05-07T20:33:33.0690197Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0690201Z 
2025-05-07T20:33:33.0690646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0690651Z 
2025-05-07T20:33:33.0690753Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0690981Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0691057Z     T=128,
2025-05-07T20:33:33.0691126Z     D=5120,
2025-05-07T20:33:33.0691206Z     scale_ub=None,
2025-05-07T20:33:33.0691285Z     contiguous=True,
2025-05-07T20:33:33.0691364Z     compiled=False,
2025-05-07T20:33:33.0691437Z )
2025-05-07T20:33:33.0691662Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0691834Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:33.0691844Z 
2025-05-07T20:33:33.0691978Z     @given(
2025-05-07T20:33:33.0692095Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0692193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0692305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0692418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0692533Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0692602Z     )
2025-05-07T20:33:33.0692857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0692992Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0693065Z         self,
2025-05-07T20:33:33.0693135Z         T: int,
2025-05-07T20:33:33.0693210Z         D: int,
2025-05-07T20:33:33.0693311Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0693396Z         contiguous: bool,
2025-05-07T20:33:33.0693476Z         compiled: bool,
2025-05-07T20:33:33.0693551Z     ) -> None:
2025-05-07T20:33:33.0693645Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0693715Z     
2025-05-07T20:33:33.0693884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0693957Z     
2025-05-07T20:33:33.0694045Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0694166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0694298Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0694376Z         x0 = x[:, :D]
2025-05-07T20:33:33.0694452Z         x1 = x[:, D:]
2025-05-07T20:33:33.0694521Z     
2025-05-07T20:33:33.0694604Z         if contiguous:
2025-05-07T20:33:33.0694695Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0694782Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0694850Z     
2025-05-07T20:33:33.0694940Z         if scale_ub is not None:
2025-05-07T20:33:33.0695084Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0695220Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0695297Z             )
2025-05-07T20:33:33.0695371Z         else:
2025-05-07T20:33:33.0695460Z             scale_ub_tensor = None
2025-05-07T20:33:33.0695535Z     
2025-05-07T20:33:33.0695661Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0695746Z             op = silu_mul_quant
2025-05-07T20:33:33.0695830Z             if compiled:
2025-05-07T20:33:33.0695928Z                 op = torch.compile(op)
2025-05-07T20:33:33.0696036Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0696105Z     
2025-05-07T20:33:33.0696193Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0696203Z 
2025-05-07T20:33:33.0696299Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0696426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0696524Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0696625Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0697168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0697260Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0697644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0697883Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0698247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0698335Z     kernel = self.compile(
2025-05-07T20:33:33.0698743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0698932Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0699077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0699082Z 
2025-05-07T20:33:33.0699320Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd1ad3a0>
2025-05-07T20:33:33.0700212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0700758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd197a60>}
2025-05-07T20:33:33.0701575Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0701830Z context = <triton._C.libtriton.ir.context object at 0x7f58fd4e7830>
2025-05-07T20:33:33.0701836Z 
2025-05-07T20:33:33.0702006Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0702280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0702383Z                            module_map=module_map)
2025-05-07T20:33:33.0702548Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0702641Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0702712Z E       ^
2025-05-07T20:33:33.0703130Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0703135Z 
2025-05-07T20:33:33.0703585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0703592Z 
2025-05-07T20:33:33.0703699Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0703928Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0704043Z     T=128,
2025-05-07T20:33:33.0704121Z     D=7168,
2025-05-07T20:33:33.0704198Z     scale_ub=None,
2025-05-07T20:33:33.0704282Z     contiguous=True,
2025-05-07T20:33:33.0704365Z     compiled=False,
2025-05-07T20:33:33.0704436Z )
2025-05-07T20:33:33.0704658Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0704831Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:33.0704836Z 
2025-05-07T20:33:33.0704912Z     @given(
2025-05-07T20:33:33.0705028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0705125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0705240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0705353Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0705461Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0705537Z     )
2025-05-07T20:33:33.0705794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0705882Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0705953Z         self,
2025-05-07T20:33:33.0706033Z         T: int,
2025-05-07T20:33:33.0706107Z         D: int,
2025-05-07T20:33:33.0706200Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0706287Z         contiguous: bool,
2025-05-07T20:33:33.0706368Z         compiled: bool,
2025-05-07T20:33:33.0706444Z     ) -> None:
2025-05-07T20:33:33.0706542Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0706612Z     
2025-05-07T20:33:33.0706781Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0706850Z     
2025-05-07T20:33:33.0706940Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0707061Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0707147Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0707224Z         x0 = x[:, :D]
2025-05-07T20:33:33.0707305Z         x1 = x[:, D:]
2025-05-07T20:33:33.0707373Z     
2025-05-07T20:33:33.0707451Z         if contiguous:
2025-05-07T20:33:33.0707542Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0707673Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0707746Z     
2025-05-07T20:33:33.0707839Z         if scale_ub is not None:
2025-05-07T20:33:33.0707940Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0708074Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0708149Z             )
2025-05-07T20:33:33.0708219Z         else:
2025-05-07T20:33:33.0708312Z             scale_ub_tensor = None
2025-05-07T20:33:33.0708381Z     
2025-05-07T20:33:33.0708507Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0708637Z             op = silu_mul_quant
2025-05-07T20:33:33.0708717Z             if compiled:
2025-05-07T20:33:33.0708811Z                 op = torch.compile(op)
2025-05-07T20:33:33.0708919Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0708994Z     
2025-05-07T20:33:33.0709080Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0709089Z 
2025-05-07T20:33:33.0709182Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0709313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0709415Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0709509Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0710146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0710246Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0710628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0710863Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0711224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0711355Z     kernel = self.compile(
2025-05-07T20:33:33.0711766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0711943Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0712072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0712076Z 
2025-05-07T20:33:33.0712286Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd4d3070>
2025-05-07T20:33:33.0713133Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0713678Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd153790>}
2025-05-07T20:33:33.0714490Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0714685Z context = <triton._C.libtriton.ir.context object at 0x7f58fd16fcb0>
2025-05-07T20:33:33.0714690Z 
2025-05-07T20:33:33.0714854Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0715128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0715235Z                            module_map=module_map)
2025-05-07T20:33:33.0715396Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0715493Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0715566Z E       ^
2025-05-07T20:33:33.0715944Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0715952Z 
2025-05-07T20:33:33.0716398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0716403Z 
2025-05-07T20:33:33.0716541Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0716769Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0716848Z     T=2048,
2025-05-07T20:33:33.0716917Z     D=7168,
2025-05-07T20:33:33.0716996Z     scale_ub=1200.0,
2025-05-07T20:33:33.0717082Z     contiguous=True,
2025-05-07T20:33:33.0717163Z     compiled=False,
2025-05-07T20:33:33.0717240Z )
2025-05-07T20:33:33.0717458Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0717673Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:33.0717678Z 
2025-05-07T20:33:33.0717752Z     @given(
2025-05-07T20:33:33.0717866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0717963Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0718076Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0718188Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0718301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0718376Z     )
2025-05-07T20:33:33.0718633Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0718726Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0718795Z         self,
2025-05-07T20:33:33.0718905Z         T: int,
2025-05-07T20:33:33.0718981Z         D: int,
2025-05-07T20:33:33.0719077Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0719162Z         contiguous: bool,
2025-05-07T20:33:33.0719248Z         compiled: bool,
2025-05-07T20:33:33.0719320Z     ) -> None:
2025-05-07T20:33:33.0719407Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0719477Z     
2025-05-07T20:33:33.0719645Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0721658Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0721664Z 
2025-05-07T20:33:33.0721777Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0721783Z 
2025-05-07T20:33:33.0721884Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0722110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0722180Z     T=1,
2025-05-07T20:33:33.0722257Z     D=5120,
2025-05-07T20:33:33.0722335Z     scale_ub=1200.0,
2025-05-07T20:33:33.0722416Z     contiguous=True,
2025-05-07T20:33:33.0722498Z     compiled=False,
2025-05-07T20:33:33.0722569Z )
2025-05-07T20:33:33.0722792Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0722963Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:33.0722968Z 
2025-05-07T20:33:33.0723039Z     @given(
2025-05-07T20:33:33.0723157Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0723253Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0723364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0723483Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0723593Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0723661Z     )
2025-05-07T20:33:33.0723920Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0724011Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0724083Z         self,
2025-05-07T20:33:33.0724168Z         T: int,
2025-05-07T20:33:33.0724239Z         D: int,
2025-05-07T20:33:33.0724378Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0724469Z         contiguous: bool,
2025-05-07T20:33:33.0724550Z         compiled: bool,
2025-05-07T20:33:33.0724626Z     ) -> None:
2025-05-07T20:33:33.0724716Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0724785Z     
2025-05-07T20:33:33.0724958Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0725029Z     
2025-05-07T20:33:33.0725117Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0725242Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0725369Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0725446Z         x0 = x[:, :D]
2025-05-07T20:33:33.0725528Z         x1 = x[:, D:]
2025-05-07T20:33:33.0725597Z     
2025-05-07T20:33:33.0725674Z         if contiguous:
2025-05-07T20:33:33.0725768Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0725855Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0725927Z     
2025-05-07T20:33:33.0726018Z         if scale_ub is not None:
2025-05-07T20:33:33.0726122Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0726257Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0726326Z             )
2025-05-07T20:33:33.0726399Z         else:
2025-05-07T20:33:33.0726491Z             scale_ub_tensor = None
2025-05-07T20:33:33.0726562Z     
2025-05-07T20:33:33.0726728Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0726818Z             op = silu_mul_quant
2025-05-07T20:33:33.0726902Z             if compiled:
2025-05-07T20:33:33.0727000Z                 op = torch.compile(op)
2025-05-07T20:33:33.0727105Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0727173Z     
2025-05-07T20:33:33.0727257Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0727304Z 
2025-05-07T20:33:33.0727398Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0727526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0727625Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0727723Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0728264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0728358Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0728740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0729000Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0729387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0729475Z     kernel = self.compile(
2025-05-07T20:33:33.0729885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0730063Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0730190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0730195Z 
2025-05-07T20:33:33.0730403Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd158730>
2025-05-07T20:33:33.0731248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0731792Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd218040>}
2025-05-07T20:33:33.0732604Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0732802Z context = <triton._C.libtriton.ir.context object at 0x7f58fd221c30>
2025-05-07T20:33:33.0732871Z 
2025-05-07T20:33:33.0733038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0733311Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0733417Z                            module_map=module_map)
2025-05-07T20:33:33.0733578Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0733672Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0733749Z E       ^
2025-05-07T20:33:33.0734188Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0734193Z 
2025-05-07T20:33:33.0734640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0734647Z 
2025-05-07T20:33:33.0734744Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0734975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0735048Z     T=2048,
2025-05-07T20:33:33.0735120Z     D=5120,
2025-05-07T20:33:33.0735196Z     scale_ub=None,
2025-05-07T20:33:33.0735280Z     contiguous=True,
2025-05-07T20:33:33.0735359Z     compiled=False,
2025-05-07T20:33:33.0735433Z )
2025-05-07T20:33:33.0735694Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0735870Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:33.0735875Z 
2025-05-07T20:33:33.0735956Z     @given(
2025-05-07T20:33:33.0736070Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0736164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0736282Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0736435Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0736542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0736617Z     )
2025-05-07T20:33:33.0736875Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0736968Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0737037Z         self,
2025-05-07T20:33:33.0737110Z         T: int,
2025-05-07T20:33:33.0737185Z         D: int,
2025-05-07T20:33:33.0737279Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0737367Z         contiguous: bool,
2025-05-07T20:33:33.0737453Z         compiled: bool,
2025-05-07T20:33:33.0737527Z     ) -> None:
2025-05-07T20:33:33.0737616Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0737691Z     
2025-05-07T20:33:33.0737857Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0737927Z     
2025-05-07T20:33:33.0738016Z >       x_sign = torch.sign(x)
2025-05-07T20:33:33.0740035Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0740041Z 
2025-05-07T20:33:33.0740156Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:33.0740161Z 
2025-05-07T20:33:33.0740258Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0740492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0740563Z     T=16384,
2025-05-07T20:33:33.0740636Z     D=5120,
2025-05-07T20:33:33.0740719Z     scale_ub=None,
2025-05-07T20:33:33.0740798Z     contiguous=True,
2025-05-07T20:33:33.0740877Z     compiled=False,
2025-05-07T20:33:33.0740949Z )
2025-05-07T20:33:33.0741213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0741393Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:33.0741400Z 
2025-05-07T20:33:33.0741472Z     @given(
2025-05-07T20:33:33.0741586Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0741682Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0741796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0741906Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0742017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0742126Z     )
2025-05-07T20:33:33.0742381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0742472Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0742545Z         self,
2025-05-07T20:33:33.0742620Z         T: int,
2025-05-07T20:33:33.0742694Z         D: int,
2025-05-07T20:33:33.0742787Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0742877Z         contiguous: bool,
2025-05-07T20:33:33.0742959Z         compiled: bool,
2025-05-07T20:33:33.0743031Z     ) -> None:
2025-05-07T20:33:33.0743123Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0743194Z     
2025-05-07T20:33:33.0743359Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0745356Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0745402Z 
2025-05-07T20:33:33.0745514Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0745520Z 
2025-05-07T20:33:33.0745619Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0745844Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0745914Z     T=4096,
2025-05-07T20:33:33.0745985Z     D=5120,
2025-05-07T20:33:33.0746061Z     scale_ub=None,
2025-05-07T20:33:33.0746148Z     contiguous=True,
2025-05-07T20:33:33.0746227Z     compiled=False,
2025-05-07T20:33:33.0746296Z )
2025-05-07T20:33:33.0746522Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0746700Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:33.0746704Z 
2025-05-07T20:33:33.0746778Z     @given(
2025-05-07T20:33:33.0746894Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0746991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0747101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0747218Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0747325Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0747407Z     )
2025-05-07T20:33:33.0747662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0747752Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0747829Z         self,
2025-05-07T20:33:33.0747900Z         T: int,
2025-05-07T20:33:33.0747973Z         D: int,
2025-05-07T20:33:33.0748071Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0748161Z         contiguous: bool,
2025-05-07T20:33:33.0748244Z         compiled: bool,
2025-05-07T20:33:33.0748319Z     ) -> None:
2025-05-07T20:33:33.0748407Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0748477Z     
2025-05-07T20:33:33.0748652Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0750757Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0750763Z 
2025-05-07T20:33:33.0750881Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0750923Z 
2025-05-07T20:33:33.0751020Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0751249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0751321Z     T=2048,
2025-05-07T20:33:33.0751395Z     D=5120,
2025-05-07T20:33:33.0751473Z     scale_ub=None,
2025-05-07T20:33:33.0751553Z     contiguous=False,
2025-05-07T20:33:33.0751633Z     compiled=False,
2025-05-07T20:33:33.0751703Z )
2025-05-07T20:33:33.0751928Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0752102Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:33.0752109Z 
2025-05-07T20:33:33.0752182Z     @given(
2025-05-07T20:33:33.0752335Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0752434Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0752544Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0752656Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0752770Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0752840Z     )
2025-05-07T20:33:33.0753094Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0753228Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0753296Z         self,
2025-05-07T20:33:33.0753369Z         T: int,
2025-05-07T20:33:33.0753444Z         D: int,
2025-05-07T20:33:33.0753540Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0753626Z         contiguous: bool,
2025-05-07T20:33:33.0753706Z         compiled: bool,
2025-05-07T20:33:33.0753780Z     ) -> None:
2025-05-07T20:33:33.0753873Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0753942Z     
2025-05-07T20:33:33.0754112Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0756064Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0756075Z 
2025-05-07T20:33:33.0756188Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0756193Z 
2025-05-07T20:33:33.0756292Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0756516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0756588Z     T=4096,
2025-05-07T20:33:33.0756667Z     D=7168,
2025-05-07T20:33:33.0756741Z     scale_ub=None,
2025-05-07T20:33:33.0756823Z     contiguous=True,
2025-05-07T20:33:33.0756903Z     compiled=True,
2025-05-07T20:33:33.0756972Z )
2025-05-07T20:33:33.0757200Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0757371Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:33.0757376Z 
2025-05-07T20:33:33.0757448Z     @given(
2025-05-07T20:33:33.0757566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0757660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0757818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0757937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0758045Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0758122Z     )
2025-05-07T20:33:33.0758377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0758468Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0758547Z         self,
2025-05-07T20:33:33.0758616Z         T: int,
2025-05-07T20:33:33.0758688Z         D: int,
2025-05-07T20:33:33.0758826Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0758910Z         contiguous: bool,
2025-05-07T20:33:33.0758990Z         compiled: bool,
2025-05-07T20:33:33.0759067Z     ) -> None:
2025-05-07T20:33:33.0759158Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0759230Z     
2025-05-07T20:33:33.0759398Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0761395Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0761403Z 
2025-05-07T20:33:33.0761519Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0761523Z 
2025-05-07T20:33:33.0761619Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0761846Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0761984Z     T=2048,
2025-05-07T20:33:33.0762056Z     D=5120,
2025-05-07T20:33:33.0762142Z     scale_ub=1200.0,
2025-05-07T20:33:33.0762224Z     contiguous=False,
2025-05-07T20:33:33.0762304Z     compiled=False,
2025-05-07T20:33:33.0762374Z )
2025-05-07T20:33:33.0762595Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0762771Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:33.0762779Z 
2025-05-07T20:33:33.0762850Z     @given(
2025-05-07T20:33:33.0762966Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0763063Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0763172Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0763285Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0763395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0763464Z     )
2025-05-07T20:33:33.0763723Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0763817Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0763890Z         self,
2025-05-07T20:33:33.0763966Z         T: int,
2025-05-07T20:33:33.0764042Z         D: int,
2025-05-07T20:33:33.0764135Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0764225Z         contiguous: bool,
2025-05-07T20:33:33.0764307Z         compiled: bool,
2025-05-07T20:33:33.0764380Z     ) -> None:
2025-05-07T20:33:33.0764473Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0764544Z     
2025-05-07T20:33:33.0768423Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0770488Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0770500Z 
2025-05-07T20:33:33.0770623Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0770628Z 
2025-05-07T20:33:33.0770735Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0770972Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0771049Z     T=4096,
2025-05-07T20:33:33.0771130Z     D=7168,
2025-05-07T20:33:33.0771212Z     scale_ub=1200.0,
2025-05-07T20:33:33.0771294Z     contiguous=True,
2025-05-07T20:33:33.0771423Z     compiled=False,
2025-05-07T20:33:33.0771497Z )
2025-05-07T20:33:33.0771725Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0771908Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:33.0771914Z 
2025-05-07T20:33:33.0771995Z     @given(
2025-05-07T20:33:33.0772117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0772223Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0772340Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0772464Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0772577Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0772654Z     )
2025-05-07T20:33:33.0772963Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0773057Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0773136Z         self,
2025-05-07T20:33:33.0773215Z         T: int,
2025-05-07T20:33:33.0773291Z         D: int,
2025-05-07T20:33:33.0773392Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0773480Z         contiguous: bool,
2025-05-07T20:33:33.0773566Z         compiled: bool,
2025-05-07T20:33:33.0773692Z     ) -> None:
2025-05-07T20:33:33.0773787Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0773863Z     
2025-05-07T20:33:33.0774038Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0776010Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0776017Z 
2025-05-07T20:33:33.0776137Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0776142Z 
2025-05-07T20:33:33.0776243Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0776481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0776558Z     T=16384,
2025-05-07T20:33:33.0776634Z     D=7168,
2025-05-07T20:33:33.0776720Z     scale_ub=None,
2025-05-07T20:33:33.0776807Z     contiguous=False,
2025-05-07T20:33:33.0776892Z     compiled=True,
2025-05-07T20:33:33.0776972Z )
2025-05-07T20:33:33.0777198Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0777381Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:33.0777385Z 
2025-05-07T20:33:33.0777464Z     @given(
2025-05-07T20:33:33.0777583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0777684Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0777802Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0777917Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0778030Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0778107Z     )
2025-05-07T20:33:33.0778368Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0778509Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0778586Z         self,
2025-05-07T20:33:33.0778667Z         T: int,
2025-05-07T20:33:33.0778751Z         D: int,
2025-05-07T20:33:33.0778865Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0778966Z         contiguous: bool,
2025-05-07T20:33:33.0779076Z         compiled: bool,
2025-05-07T20:33:33.0779162Z     ) -> None:
2025-05-07T20:33:33.0779256Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0779334Z     
2025-05-07T20:33:33.0779506Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0781509Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0781516Z 
2025-05-07T20:33:33.0781633Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0781637Z 
2025-05-07T20:33:33.0781740Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0782011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0782088Z     T=4096,
2025-05-07T20:33:33.0782169Z     D=7168,
2025-05-07T20:33:33.0782257Z     scale_ub=None,
2025-05-07T20:33:33.0782343Z     contiguous=True,
2025-05-07T20:33:33.0782433Z     compiled=False,
2025-05-07T20:33:33.0782506Z )
2025-05-07T20:33:33.0782911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0783214Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:33.0783220Z 
2025-05-07T20:33:33.0783298Z     @given(
2025-05-07T20:33:33.0783419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0783518Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0783632Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0783745Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0783856Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0783938Z     )
2025-05-07T20:33:33.0784193Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0784290Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0784365Z         self,
2025-05-07T20:33:33.0784438Z         T: int,
2025-05-07T20:33:33.0784515Z         D: int,
2025-05-07T20:33:33.0784612Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0784699Z         contiguous: bool,
2025-05-07T20:33:33.0784788Z         compiled: bool,
2025-05-07T20:33:33.0784864Z     ) -> None:
2025-05-07T20:33:33.0784959Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0785031Z     
2025-05-07T20:33:33.0785201Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0787168Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0787175Z 
2025-05-07T20:33:33.0787290Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0787297Z 
2025-05-07T20:33:33.0787404Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0787635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0787794Z     T=16384,
2025-05-07T20:33:33.0787876Z     D=7168,
2025-05-07T20:33:33.0787957Z     scale_ub=None,
2025-05-07T20:33:33.0788041Z     contiguous=True,
2025-05-07T20:33:33.0788127Z     compiled=False,
2025-05-07T20:33:33.0788198Z )
2025-05-07T20:33:33.0788420Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0788603Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:33.0788608Z 
2025-05-07T20:33:33.0788680Z     @given(
2025-05-07T20:33:33.0788799Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0788957Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0789068Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0789186Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0789299Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0789374Z     )
2025-05-07T20:33:33.0789638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0789729Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0789876Z         self,
2025-05-07T20:33:33.0789955Z         T: int,
2025-05-07T20:33:33.0790031Z         D: int,
2025-05-07T20:33:33.0790124Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0790213Z         contiguous: bool,
2025-05-07T20:33:33.0790357Z         compiled: bool,
2025-05-07T20:33:33.0790437Z     ) -> None:
2025-05-07T20:33:33.0790528Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0790598Z     
2025-05-07T20:33:33.0790772Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0792731Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0792799Z 
2025-05-07T20:33:33.0792916Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0792921Z 
2025-05-07T20:33:33.0793023Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0793251Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0793330Z     T=16384,
2025-05-07T20:33:33.0793405Z     D=7168,
2025-05-07T20:33:33.0793488Z     scale_ub=1200.0,
2025-05-07T20:33:33.0793572Z     contiguous=True,
2025-05-07T20:33:33.0793651Z     compiled=False,
2025-05-07T20:33:33.0793729Z )
2025-05-07T20:33:33.0793954Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0794133Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:33.0794138Z 
2025-05-07T20:33:33.0794216Z     @given(
2025-05-07T20:33:33.0794331Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0794426Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0794540Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0794656Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0794766Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0794839Z     )
2025-05-07T20:33:33.0795097Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0795191Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0795268Z         self,
2025-05-07T20:33:33.0795341Z         T: int,
2025-05-07T20:33:33.0795418Z         D: int,
2025-05-07T20:33:33.0795516Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0795603Z         contiguous: bool,
2025-05-07T20:33:33.0795696Z         compiled: bool,
2025-05-07T20:33:33.0795772Z     ) -> None:
2025-05-07T20:33:33.0795916Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0795992Z     
2025-05-07T20:33:33.0796160Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0798117Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0798163Z 
2025-05-07T20:33:33.0798277Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0798282Z 
2025-05-07T20:33:33.0798385Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0798613Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0798687Z     T=128,
2025-05-07T20:33:33.0798766Z     D=5120,
2025-05-07T20:33:33.0798849Z     scale_ub=1200.0,
2025-05-07T20:33:33.0798931Z     contiguous=False,
2025-05-07T20:33:33.0799014Z     compiled=False,
2025-05-07T20:33:33.0799084Z )
2025-05-07T20:33:33.0799370Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0799550Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:33.0799557Z 
2025-05-07T20:33:33.0799630Z     @given(
2025-05-07T20:33:33.0799747Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0799845Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0799955Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0800480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0800590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0800663Z     )
2025-05-07T20:33:33.0800930Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0801021Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0801096Z         self,
2025-05-07T20:33:33.0801173Z         T: int,
2025-05-07T20:33:33.0801247Z         D: int,
2025-05-07T20:33:33.0801343Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0801434Z         contiguous: bool,
2025-05-07T20:33:33.0801516Z         compiled: bool,
2025-05-07T20:33:33.0801594Z     ) -> None:
2025-05-07T20:33:33.0801688Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0801761Z     
2025-05-07T20:33:33.0801936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0802007Z     
2025-05-07T20:33:33.0802099Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0802229Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0802316Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0802393Z         x0 = x[:, :D]
2025-05-07T20:33:33.0802476Z         x1 = x[:, D:]
2025-05-07T20:33:33.0802546Z     
2025-05-07T20:33:33.0802629Z         if contiguous:
2025-05-07T20:33:33.0802723Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0802812Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0802885Z     
2025-05-07T20:33:33.0802979Z         if scale_ub is not None:
2025-05-07T20:33:33.0803086Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0803227Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0803300Z             )
2025-05-07T20:33:33.0803374Z         else:
2025-05-07T20:33:33.0803469Z             scale_ub_tensor = None
2025-05-07T20:33:33.0803539Z     
2025-05-07T20:33:33.0803668Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0803763Z             op = silu_mul_quant
2025-05-07T20:33:33.0803847Z             if compiled:
2025-05-07T20:33:33.0803943Z                 op = torch.compile(op)
2025-05-07T20:33:33.0804050Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0804167Z     
2025-05-07T20:33:33.0804256Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0804266Z 
2025-05-07T20:33:33.0804361Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0804493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0804600Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0804701Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0805246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0805386Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0805770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0806008Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0806374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0806465Z     kernel = self.compile(
2025-05-07T20:33:33.0806878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0807054Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0807223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0807228Z 
2025-05-07T20:33:33.0807442Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fd0b6310>
2025-05-07T20:33:33.0808294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0808911Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fd00bca0>}
2025-05-07T20:33:33.0809747Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0809949Z context = <triton._C.libtriton.ir.context object at 0x7f58fcf845b0>
2025-05-07T20:33:33.0809953Z 
2025-05-07T20:33:33.0810121Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0810396Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0810508Z                            module_map=module_map)
2025-05-07T20:33:33.0810669Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0810768Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0810846Z E       ^
2025-05-07T20:33:33.0811230Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0811235Z 
2025-05-07T20:33:33.0811683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0811687Z 
2025-05-07T20:33:33.0811788Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0812019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0812098Z     T=2048,
2025-05-07T20:33:33.0812170Z     D=7168,
2025-05-07T20:33:33.0812251Z     scale_ub=None,
2025-05-07T20:33:33.0812343Z     contiguous=False,
2025-05-07T20:33:33.0812423Z     compiled=False,
2025-05-07T20:33:33.0812498Z )
2025-05-07T20:33:33.0812719Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0812899Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:33.0812904Z 
2025-05-07T20:33:33.0812984Z     @given(
2025-05-07T20:33:33.0813101Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0813239Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0813357Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0813474Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0813589Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0813659Z     )
2025-05-07T20:33:33.0813918Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0814012Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0814087Z         self,
2025-05-07T20:33:33.0814199Z         T: int,
2025-05-07T20:33:33.0814281Z         D: int,
2025-05-07T20:33:33.0814380Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0814467Z         contiguous: bool,
2025-05-07T20:33:33.0814555Z         compiled: bool,
2025-05-07T20:33:33.0814630Z     ) -> None:
2025-05-07T20:33:33.0814721Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0814798Z     
2025-05-07T20:33:33.0814971Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0816972Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0816981Z 
2025-05-07T20:33:33.0817097Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0817101Z 
2025-05-07T20:33:33.0817253Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0817481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0817556Z     T=128,
2025-05-07T20:33:33.0817637Z     D=7168,
2025-05-07T20:33:33.0817720Z     scale_ub=1200.0,
2025-05-07T20:33:33.0817802Z     contiguous=True,
2025-05-07T20:33:33.0817886Z     compiled=True,
2025-05-07T20:33:33.0817956Z )
2025-05-07T20:33:33.0818182Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0818364Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:33.0818368Z 
2025-05-07T20:33:33.0818442Z     @given(
2025-05-07T20:33:33.0818563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0818662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0818776Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0818893Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0819007Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0819081Z     )
2025-05-07T20:33:33.0819339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0819431Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0819508Z         self,
2025-05-07T20:33:33.0819584Z         T: int,
2025-05-07T20:33:33.0819657Z         D: int,
2025-05-07T20:33:33.0819752Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0819841Z         contiguous: bool,
2025-05-07T20:33:33.0819925Z         compiled: bool,
2025-05-07T20:33:33.0820003Z     ) -> None:
2025-05-07T20:33:33.0820095Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0820163Z     
2025-05-07T20:33:33.0820338Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0820413Z     
2025-05-07T20:33:33.0820503Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0820632Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0820723Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0820803Z         x0 = x[:, :D]
2025-05-07T20:33:33.0820884Z         x1 = x[:, D:]
2025-05-07T20:33:33.0820955Z     
2025-05-07T20:33:33.0821038Z         if contiguous:
2025-05-07T20:33:33.0821179Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0821269Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0821346Z     
2025-05-07T20:33:33.0821436Z         if scale_ub is not None:
2025-05-07T20:33:33.0821539Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0821682Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0821757Z             )
2025-05-07T20:33:33.0821830Z         else:
2025-05-07T20:33:33.0821924Z             scale_ub_tensor = None
2025-05-07T20:33:33.0822035Z     
2025-05-07T20:33:33.0822162Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0822255Z             op = silu_mul_quant
2025-05-07T20:33:33.0822339Z             if compiled:
2025-05-07T20:33:33.0822438Z                 op = torch.compile(op)
2025-05-07T20:33:33.0822546Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0822618Z     
2025-05-07T20:33:33.0822712Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0822716Z 
2025-05-07T20:33:33.0822814Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0822944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0823049Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0823144Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0823576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.0823671Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.0824205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0824311Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0824691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0824970Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0825336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0825429Z     kernel = self.compile(
2025-05-07T20:33:33.0825838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0826020Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0826148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0826153Z 
2025-05-07T20:33:33.0826370Z self = <triton.compiler.compiler.ASTSource object at 0x7f58fcee5ee0>
2025-05-07T20:33:33.0827218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0827769Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f58f14068b0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f58fcf3c0d0>}
2025-05-07T20:33:33.0828586Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0828783Z context = <triton._C.libtriton.ir.context object at 0x7f58fceb9270>
2025-05-07T20:33:33.0828787Z 
2025-05-07T20:33:33.0828957Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0829232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0829342Z                            module_map=module_map)
2025-05-07T20:33:33.0829505Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0829604Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0829679Z E       ^
2025-05-07T20:33:33.0830204Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0830210Z 
2025-05-07T20:33:33.0830657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0830662Z 
2025-05-07T20:33:33.0830768Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0830997Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0831080Z     T=128,
2025-05-07T20:33:33.0831153Z     D=7168,
2025-05-07T20:33:33.0831300Z     scale_ub=1200.0,
2025-05-07T20:33:33.0831388Z     contiguous=True,
2025-05-07T20:33:33.0831471Z     compiled=False,
2025-05-07T20:33:33.0831544Z )
2025-05-07T20:33:33.0831767Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0831944Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:33.0831948Z 
2025-05-07T20:33:33.0832022Z     @given(
2025-05-07T20:33:33.0832143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0832239Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0832359Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0832473Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0832622Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0832699Z     )
2025-05-07T20:33:33.0832964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0833059Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0833135Z         self,
2025-05-07T20:33:33.0833212Z         T: int,
2025-05-07T20:33:33.0833287Z         D: int,
2025-05-07T20:33:33.0833382Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0833512Z         contiguous: bool,
2025-05-07T20:33:33.0833595Z         compiled: bool,
2025-05-07T20:33:33.0833671Z     ) -> None:
2025-05-07T20:33:33.0833765Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0833835Z     
2025-05-07T20:33:33.0834014Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0834083Z     
2025-05-07T20:33:33.0834173Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0834298Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0836268Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0836281Z 
2025-05-07T20:33:33.0836399Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:33.0836403Z 
2025-05-07T20:33:33.0836504Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0836733Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0836810Z     T=128,
2025-05-07T20:33:33.0836885Z     D=5120,
2025-05-07T20:33:33.0836969Z     scale_ub=1200.0,
2025-05-07T20:33:33.0837053Z     contiguous=True,
2025-05-07T20:33:33.0837134Z     compiled=True,
2025-05-07T20:33:33.0837209Z )
2025-05-07T20:33:33.0837434Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0837611Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:33.0837615Z 
2025-05-07T20:33:33.0837691Z     @given(
2025-05-07T20:33:33.0837806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0837906Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0838022Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0838135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0838295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0838368Z     )
2025-05-07T20:33:33.0838629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0838725Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0838797Z         self,
2025-05-07T20:33:33.0838871Z         T: int,
2025-05-07T20:33:33.0838948Z         D: int,
2025-05-07T20:33:33.0839043Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0839130Z         contiguous: bool,
2025-05-07T20:33:33.0839261Z         compiled: bool,
2025-05-07T20:33:33.0839335Z     ) -> None:
2025-05-07T20:33:33.0839426Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0839499Z     
2025-05-07T20:33:33.0839666Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0839744Z     
2025-05-07T20:33:33.0839834Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0839956Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0841949Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0841959Z 
2025-05-07T20:33:33.0842074Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:33.0842079Z 
2025-05-07T20:33:33.0842186Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0842453Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0842526Z     T=128,
2025-05-07T20:33:33.0842604Z     D=7168,
2025-05-07T20:33:33.0842684Z     scale_ub=None,
2025-05-07T20:33:33.0842771Z     contiguous=True,
2025-05-07T20:33:33.0842855Z     compiled=True,
2025-05-07T20:33:33.0842926Z )
2025-05-07T20:33:33.0843152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0843321Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:33.0843329Z 
2025-05-07T20:33:33.0843405Z     @given(
2025-05-07T20:33:33.0843524Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0843619Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0843733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0843857Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0843967Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0844042Z     )
2025-05-07T20:33:33.0844302Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0844393Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0844472Z         self,
2025-05-07T20:33:33.0844549Z         T: int,
2025-05-07T20:33:33.0844621Z         D: int,
2025-05-07T20:33:33.0844719Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0844803Z         contiguous: bool,
2025-05-07T20:33:33.0844889Z         compiled: bool,
2025-05-07T20:33:33.0844968Z     ) -> None:
2025-05-07T20:33:33.0845063Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0845134Z     
2025-05-07T20:33:33.0845306Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0847293Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:33.0847302Z 
2025-05-07T20:33:33.0847419Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:33.0847552Z =============================== warnings summary ===============================
2025-05-07T20:33:33.0847878Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:33.0848192Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:33.0848542Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:33.0849504Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:33.0849748Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:33.0849752Z 
2025-05-07T20:33:33.0849974Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:33.0850145Z ================= 1 failed, 1 deselected, 3 warnings in 19.38s =================
2025-05-07T20:33:34.6443316Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:34.7073402Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:33:34.7073735Z 
2025-05-07T20:33:34.7073971Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:33:34.7074920Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:33:34.7075354Z 
2025-05-07T20:33:34.7075358Z 
2025-05-07T20:33:34.7075368Z 
2025-05-07T20:33:34.7092144Z ##[error]Process completed with exit code 1.
2025-05-07T20:33:34.7173739Z Post job cleanup.
2025-05-07T20:33:34.8164868Z [command]/usr/bin/git version
2025-05-07T20:33:34.8207865Z git version 2.47.1
2025-05-07T20:33:34.8247542Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/1cb924ec-5521-4663-8170-b1ccd2c7d762/.gitconfig'
2025-05-07T20:33:34.8260072Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/1cb924ec-5521-4663-8170-b1ccd2c7d762' before making global git config changes
2025-05-07T20:33:34.8261117Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:33:34.8265795Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:33:34.8311698Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:33:34.8345945Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:33:34.8679958Z Entering 'external/asmjit'
2025-05-07T20:33:34.8751287Z Entering 'external/composable_kernel'
2025-05-07T20:33:34.8825041Z Entering 'external/cpuinfo'
2025-05-07T20:33:34.8891703Z Entering 'external/cutlass'
2025-05-07T20:33:34.8966828Z Entering 'external/googletest'
2025-05-07T20:33:34.9033797Z Entering 'external/hipify_torch'
2025-05-07T20:33:34.9099985Z Entering 'external/json'
2025-05-07T20:33:34.9186304Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:33:34.9208870Z http.https://github.com/.extraheader
2025-05-07T20:33:34.9220062Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:33:34.9251964Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:33:34.9579114Z Entering 'external/asmjit'
2025-05-07T20:33:34.9622722Z http.https://github.com/.extraheader
2025-05-07T20:33:34.9666308Z Entering 'external/composable_kernel'
2025-05-07T20:33:34.9710365Z http.https://github.com/.extraheader
2025-05-07T20:33:34.9759864Z Entering 'external/cpuinfo'
2025-05-07T20:33:34.9802260Z http.https://github.com/.extraheader
2025-05-07T20:33:34.9844867Z Entering 'external/cutlass'
2025-05-07T20:33:34.9887459Z http.https://github.com/.extraheader
2025-05-07T20:33:34.9940096Z Entering 'external/googletest'
2025-05-07T20:33:34.9981839Z http.https://github.com/.extraheader
2025-05-07T20:33:35.0025346Z Entering 'external/hipify_torch'
2025-05-07T20:33:35.0067474Z http.https://github.com/.extraheader
2025-05-07T20:33:35.0110953Z Entering 'external/json'
2025-05-07T20:33:35.0152969Z http.https://github.com/.extraheader
2025-05-07T20:33:35.0301882Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:33:35.0333337Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:33:35.0343811Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:33:35.0344181Z ##[endgroup]
2025-05-07T20:33:35.0444695Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:33:45.8670401Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:34:02.2153500Z Cleaning up orphan processes